Identify Key Drivers of Economic Performance:
Assess Variations Across Subgroups:
Generate Actionable Insights:
Develop Robust Predictive Models:
Policy Makers:
Financial Institutions and Investors:
Researchers and Economists:
General Public:
What are the main financial indicators that influence economic performance and stability?
How do the impacts of these indicators differ across regions and income groups?
Can we use a composite measure (e.g., Financial_Depth or Financial_Depth_log) to summarize the financial system’s development effectively?
What policy measures could enhance financial development and stability, based on the influence of these indicators?
In this phase, we aim to develop a deep understanding of the available data, assess its quality, and uncover initial insights. This will inform our subsequent data preparation and modeling steps.
Below is a list of the primary financial indicators that form the core of your analysis:
Domestic credit provided by the financial sector (% of GDP)
Indicator Code: FS.AST.DOMS.GD.ZS
Corresponding Meta Data File: Metadata_Country_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv (Part 4)
Domestic credit to private sector by banks (% of GDP)
Indicator Code: FD.AST.PRVT.GD.ZS
Corresponding Meta Data File: Metadata_Country_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv (Part 2)
Domestic credit to private sector (% of GDP)
Indicator Code: FS.AST.PRVT.GD.ZS
Corresponding Meta Data File: Metadata_Country_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv (Final meta data file)
Broad money (% of GDP)
Indicator Code: FM.LBL.BMNY.GD.ZS
Corresponding Meta Data File: Metadata_Country_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv (Part 3)
Foreign direct investment, net inflows (% of GDP)
Indicator Code: BX.KLT.DINV.WD.GD.ZS
Corresponding Meta Data File: Metadata_Indicator_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
Bank nonperforming loans to total gross loans (%)
Indicator Code: FB.AST.NPER.ZS
Corresponding Meta Data File: Metadata_Indicator_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
!pip install xgboost
Requirement already satisfied: xgboost in c:\programdata\anaconda3\lib\site-packages (3.0.0) Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from xgboost) (1.24.3) Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (from xgboost) (1.11.1)
!pip install shap
Requirement already satisfied: shap in c:\programdata\anaconda3\lib\site-packages (0.47.1) Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from shap) (1.24.3) Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (from shap) (1.11.1) Requirement already satisfied: scikit-learn in c:\programdata\anaconda3\lib\site-packages (from shap) (1.3.0) Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (from shap) (2.0.3) Requirement already satisfied: tqdm>=4.27.0 in c:\programdata\anaconda3\lib\site-packages (from shap) (4.65.0) Requirement already satisfied: packaging>20.9 in c:\programdata\anaconda3\lib\site-packages (from shap) (23.1) Requirement already satisfied: slicer==0.0.8 in c:\programdata\anaconda3\lib\site-packages (from shap) (0.0.8) Requirement already satisfied: numba>=0.54 in c:\programdata\anaconda3\lib\site-packages (from shap) (0.57.1) Requirement already satisfied: cloudpickle in c:\programdata\anaconda3\lib\site-packages (from shap) (2.2.1) Requirement already satisfied: typing-extensions in c:\programdata\anaconda3\lib\site-packages (from shap) (4.7.1) Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in c:\programdata\anaconda3\lib\site-packages (from numba>=0.54->shap) (0.40.0) Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from tqdm>=4.27.0->shap) (0.4.6) Requirement already satisfied: python-dateutil>=2.8.2 in c:\programdata\anaconda3\lib\site-packages (from pandas->shap) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas->shap) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\programdata\anaconda3\lib\site-packages (from pandas->shap) (2023.3) Requirement already satisfied: joblib>=1.1.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->shap) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->shap) (2.2.0) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas->shap) (1.16.0)
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("scikit-learn version:", sklearn.__version__)
print("matplotlib version:", plt.matplotlib.__version__)
print("seaborn version:", sns.__version__)
import scipy
print("scipy version:", scipy.__version__)
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
import shap
from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt
pandas version: 2.0.3 numpy version: 1.24.3 scikit-learn version: 1.3.0 matplotlib version: 3.7.2 seaborn version: 0.12.2 scipy version: 1.11.1
# List of all file names with the './data/' prefix
def load_csv_with_skip(file, skip_rows=4, **kwargs):
"""
Attempts to load a CSV file while skipping the first 'skip_rows' lines.
"""
try:
df = pd.read_csv(file, encoding='latin1', skiprows=skip_rows, **kwargs)
return df
except Exception as e:
print(f"Error loading {file}: {e}")
return None
# Define file lists for each category
primary_files = [
"./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv",
"./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv",
"./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv",
"./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv",
"./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv",
"./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv"
]
country_meta_files = [
"./data/Metadata_Country_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv",
"./data/Metadata_Country_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv",
"./data/Metadata_Country_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv",
"./data/Metadata_Country_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv",
"./data/Metadata_Country_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv",
"./data/Metadata_Country_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv"
]
indicator_meta_files = [
"./data/Metadata_Indicator_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv",
"./data/Metadata_Indicator_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv",
"./data/Metadata_Indicator_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv",
"./data/Metadata_Indicator_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv",
"./data/Metadata_Indicator_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv",
"./data/Metadata_Indicator_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv"
]
# Dictionaries to store each group's DataFrames
primary_data = {}
country_meta_data = {}
indicator_meta_data = {}
def load_and_inspect(files, group_name, skip_rows=4):
"""
Loads each CSV file in 'files' using a skiprows parameter.
Prints head, a random sample, info, and describe outputs.
"""
print(f"\n==== Loading {group_name} Files ====\n")
for file in files:
try:
# Use skip_rows only for the primary files; meta data files usually load without it.
df = pd.read_csv(file, encoding='latin1', skiprows=skip_rows) if group_name == "Primary Financial Indicators" else pd.read_csv(file, encoding='latin1')
print(f"Loaded file: {file}")
print("-" * 60)
print("Head:")
print(df.head(), "\n")
# If the DataFrame has less than 5 rows, take all rows; otherwise sample 5.
sample_n = min(5, len(df))
print(f"Random Sample ({sample_n} rows):")
print(df.sample(sample_n, random_state=42), "\n")
print("Info:")
df.info()
print("\n")
print("Describe:")
print(df.describe(include='all'), "\n")
# Verify presence of key identifiers
if 'Country Code' in df.columns:
print("Key identifier 'Country Code' is present.\n")
elif 'country_code' in df.columns:
print("Key identifier 'country_code' is present.\n")
else:
print("Warning: No key identifier found in this file.\n")
print("=" * 80, "\n")
yield file, df
except Exception as e:
print(f"Error loading {file}: {e}\n")
# Load Primary Financial Indicators files (using skiprows=4 for these)
for f, d in load_and_inspect(primary_files, "Primary Financial Indicators", skip_rows=4):
primary_data[f] = d
==== Loading Primary Financial Indicators Files ====
Loaded file: ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
------------------------------------------------------------
Head:
Country Name Country Code \
0 Aruba ABW
1 Africa Eastern and Southern AFE
2 Afghanistan AFG
3 Africa Western and Central AFW
4 Angola AGO
Indicator Name Indicator Code \
0 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
1 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
2 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
3 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
4 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 \
0 NaN NaN NaN NaN NaN NaN ... -0.907264 0.923398 5.253373
1 NaN NaN NaN NaN NaN NaN ... 3.169598 1.952934 0.991785
2 NaN NaN NaN NaN NaN NaN ... 0.884000 0.516606 0.274797
3 NaN NaN NaN NaN NaN NaN ... 1.956009 1.827298 1.655975
4 NaN NaN NaN NaN NaN NaN ... 11.081339 -0.340243 -10.038377
2018 2019 2020 2021 2022 2023 Unnamed: 68
0 4.399822 -2.215283 6.489776 4.558715 7.567072 2.506331 NaN
1 1.272290 1.424519 1.361762 5.012059 1.695914 1.760090 NaN
2 0.661572 0.124496 0.064994 0.215925 NaN NaN NaN
3 1.052949 1.228532 1.147178 1.778296 1.204459 0.836211 NaN
4 -8.125891 -5.780813 -3.848264 -6.548542 -6.320564 -2.498840 NaN
[5 rows x 69 columns]
Random Sample (5 rows):
Country Name Country Code \
181 OECD members OED
119 Japan JPN
139 Lower middle income LMC
216 South Sudan SSD
45 Colombia COL
Indicator Name Indicator Code \
181 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
119 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
139 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
216 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
45 Foreign direct investment, net inflows (% of GDP) BX.KLT.DINV.WD.GD.ZS
1960 1961 1962 1963 1964 1965 ... 2015 2016 2017 \
181 NaN NaN NaN NaN NaN NaN ... 3.564339 3.838915 2.723947
119 NaN NaN NaN NaN NaN NaN ... 0.118162 0.818482 0.381320
139 NaN NaN NaN NaN NaN NaN ... 2.393749 2.106233 1.846860
216 NaN NaN NaN NaN NaN NaN ... 0.001250 NaN NaN
45 NaN NaN NaN NaN NaN NaN ... 3.959399 4.901646 4.393182
2018 2019 2020 2021 2022 2023 Unnamed: 68
181 0.624485 1.890481 1.164793 2.022343 1.579440 0.157191 NaN
119 0.501685 0.780785 1.237932 0.695726 1.127819 0.475297 NaN
139 1.732331 1.767694 1.883258 1.553918 1.513356 1.226273 NaN
216 NaN NaN NaN NaN NaN NaN NaN
45 3.380836 4.330582 2.758894 3.001757 4.975683 4.716592 NaN
[5 rows x 69 columns]
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 Indicator Code 266 non-null object
4 1960 0 non-null float64
5 1961 0 non-null float64
6 1962 0 non-null float64
7 1963 0 non-null float64
8 1964 0 non-null float64
9 1965 0 non-null float64
10 1966 0 non-null float64
11 1967 0 non-null float64
12 1968 0 non-null float64
13 1969 0 non-null float64
14 1970 146 non-null float64
15 1971 148 non-null float64
16 1972 157 non-null float64
17 1973 160 non-null float64
18 1974 163 non-null float64
19 1975 167 non-null float64
20 1976 166 non-null float64
21 1977 172 non-null float64
22 1978 174 non-null float64
23 1979 180 non-null float64
24 1980 186 non-null float64
25 1981 189 non-null float64
26 1982 192 non-null float64
27 1983 193 non-null float64
28 1984 194 non-null float64
29 1985 198 non-null float64
30 1986 198 non-null float64
31 1987 198 non-null float64
32 1988 198 non-null float64
33 1989 199 non-null float64
34 1990 164 non-null float64
35 1991 173 non-null float64
36 1992 180 non-null float64
37 1993 196 non-null float64
38 1994 200 non-null float64
39 1995 207 non-null float64
40 1996 209 non-null float64
41 1997 211 non-null float64
42 1998 211 non-null float64
43 1999 210 non-null float64
44 2000 210 non-null float64
45 2001 215 non-null float64
46 2002 219 non-null float64
47 2003 221 non-null float64
48 2004 221 non-null float64
49 2005 229 non-null float64
50 2006 231 non-null float64
51 2007 234 non-null float64
52 2008 235 non-null float64
53 2009 234 non-null float64
54 2010 235 non-null float64
55 2011 236 non-null float64
56 2012 236 non-null float64
57 2013 235 non-null float64
58 2014 240 non-null float64
59 2015 239 non-null float64
60 2016 238 non-null float64
61 2017 238 non-null float64
62 2018 237 non-null float64
63 2019 236 non-null float64
64 2020 233 non-null float64
65 2021 233 non-null float64
66 2022 231 non-null float64
67 2023 222 non-null float64
68 Unnamed: 68 0 non-null float64
dtypes: float64(65), object(4)
memory usage: 143.5+ KB
Describe:
Country Name Country Code \
count 266 266
unique 266 266
top Aruba ABW
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Indicator Name \
count 266
unique 1
top Foreign direct investment, net inflows (% of GDP)
freq 266
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Indicator Code 1960 1961 1962 1963 1964 1965 ... \
count 266 0.0 0.0 0.0 0.0 0.0 0.0 ...
unique 1 NaN NaN NaN NaN NaN NaN ...
top BX.KLT.DINV.WD.GD.ZS NaN NaN NaN NaN NaN NaN ...
freq 266 NaN NaN NaN NaN NaN NaN ...
mean NaN NaN NaN NaN NaN NaN NaN ...
std NaN NaN NaN NaN NaN NaN NaN ...
min NaN NaN NaN NaN NaN NaN NaN ...
25% NaN NaN NaN NaN NaN NaN NaN ...
50% NaN NaN NaN NaN NaN NaN NaN ...
75% NaN NaN NaN NaN NaN NaN NaN ...
max NaN NaN NaN NaN NaN NaN NaN ...
2015 2016 2017 2018 2019 \
count 239.000000 238.000000 238.000000 237.000000 236.000000
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 16.941509 8.454749 7.862540 -0.698425 6.355253
std 127.982377 63.274588 43.846069 90.086652 25.185659
min -7.418557 -37.172653 -28.307232 -1303.108267 -13.674036
25% 1.211047 1.204552 1.358597 0.977667 1.206220
50% 2.610534 2.659908 2.580188 2.142048 2.282700
75% 4.846196 4.880496 4.499676 3.804866 4.207715
max 1509.002098 972.693539 602.374526 452.221040 252.920356
2020 2021 2022 2023 Unnamed: 68
count 233.000000 233.000000 231.000000 222.000000 0.0
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 4.440260 6.302615 2.199459 3.709134 NaN
std 22.693943 29.270491 25.989672 21.411557 NaN
min -101.833102 -14.409179 -360.352807 -34.010361 NaN
25% 0.634851 1.227174 1.086839 0.435969 NaN
50% 1.726378 2.640303 2.177450 1.467841 NaN
75% 3.396835 4.558715 4.162669 3.122306 NaN
max 200.831508 433.860897 126.275788 286.040068 NaN
[11 rows x 69 columns]
Key identifier 'Country Code' is present.
================================================================================
Loaded file: ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
------------------------------------------------------------
Head:
Country Name Country Code \
0 Aruba ABW
1 Africa Eastern and Southern AFE
2 Afghanistan AFG
3 Africa Western and Central AFW
4 Angola AGO
Indicator Name Indicator Code 1960 \
0 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
1 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
2 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
3 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
4 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
1961 1962 1963 1964 1965 ... 2015 2016 2017 \
0 NaN NaN NaN NaN NaN ... NaN NaN NaN
1 NaN NaN NaN NaN NaN ... NaN NaN NaN
2 NaN NaN NaN NaN NaN ... 12.053618 11.074379 12.201830
3 NaN NaN NaN NaN NaN ... NaN NaN NaN
4 NaN NaN NaN NaN NaN ... 10.611234 11.285086 25.836061
2018 2019 2020 2021 2022 2023 Unnamed: 68
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 8.886660 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 23.221737 23.145469 12.223811 14.996088 NaN NaN NaN
[5 rows x 69 columns]
Random Sample (5 rows):
Country Name Country Code \
181 OECD members OED
119 Japan JPN
139 Lower middle income LMC
216 South Sudan SSD
45 Colombia COL
Indicator Name Indicator Code 1960 \
181 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
119 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
139 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
216 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
45 Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS NaN
1961 1962 1963 1964 1965 ... 2015 2016 2017 \
181 NaN NaN NaN NaN NaN ... NaN NaN NaN
119 NaN NaN NaN NaN NaN ... 1.533600 1.396770 1.191848
139 NaN NaN NaN NaN NaN ... NaN NaN NaN
216 NaN NaN NaN NaN NaN ... NaN NaN NaN
45 NaN NaN NaN NaN NaN ... 2.847673 3.124537 4.179781
2018 2019 2020 2021 2022 2023 Unnamed: 68
181 NaN NaN NaN NaN NaN NaN NaN
119 1.073387 1.111436 1.138979 1.227519 1.234010 NaN NaN
139 NaN NaN NaN NaN NaN NaN NaN
216 NaN NaN NaN NaN NaN NaN NaN
45 3.427637 3.149631 3.674749 3.034032 2.467973 NaN NaN
[5 rows x 69 columns]
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 Indicator Code 266 non-null object
4 1960 0 non-null float64
5 1961 0 non-null float64
6 1962 0 non-null float64
7 1963 0 non-null float64
8 1964 0 non-null float64
9 1965 0 non-null float64
10 1966 0 non-null float64
11 1967 0 non-null float64
12 1968 0 non-null float64
13 1969 0 non-null float64
14 1970 0 non-null float64
15 1971 0 non-null float64
16 1972 0 non-null float64
17 1973 0 non-null float64
18 1974 0 non-null float64
19 1975 0 non-null float64
20 1976 0 non-null float64
21 1977 0 non-null float64
22 1978 0 non-null float64
23 1979 0 non-null float64
24 1980 0 non-null float64
25 1981 0 non-null float64
26 1982 0 non-null float64
27 1983 0 non-null float64
28 1984 0 non-null float64
29 1985 0 non-null float64
30 1986 0 non-null float64
31 1987 0 non-null float64
32 1988 0 non-null float64
33 1989 0 non-null float64
34 1990 0 non-null float64
35 1991 0 non-null float64
36 1992 0 non-null float64
37 1993 0 non-null float64
38 1994 0 non-null float64
39 1995 0 non-null float64
40 1996 0 non-null float64
41 1997 0 non-null float64
42 1998 0 non-null float64
43 1999 0 non-null float64
44 2000 2 non-null float64
45 2001 3 non-null float64
46 2002 3 non-null float64
47 2003 5 non-null float64
48 2004 5 non-null float64
49 2005 30 non-null float64
50 2006 36 non-null float64
51 2007 42 non-null float64
52 2008 74 non-null float64
53 2009 86 non-null float64
54 2010 113 non-null float64
55 2011 116 non-null float64
56 2012 122 non-null float64
57 2013 125 non-null float64
58 2014 124 non-null float64
59 2015 136 non-null float64
60 2016 140 non-null float64
61 2017 144 non-null float64
62 2018 147 non-null float64
63 2019 144 non-null float64
64 2020 140 non-null float64
65 2021 138 non-null float64
66 2022 125 non-null float64
67 2023 0 non-null float64
68 Unnamed: 68 0 non-null float64
dtypes: float64(65), object(4)
memory usage: 143.5+ KB
Describe:
Country Name Country Code \
count 266 266
unique 266 266
top Aruba ABW
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Indicator Name Indicator Code \
count 266 266
unique 1 1
top Bank nonperforming loans to total gross loans (%) FB.AST.NPER.ZS
freq 266 266
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
1960 1961 1962 1963 1964 1965 ... 2015 2016 \
count 0.0 0.0 0.0 0.0 0.0 0.0 ... 136.000000 140.000000
unique NaN NaN NaN NaN NaN NaN ... NaN NaN
top NaN NaN NaN NaN NaN NaN ... NaN NaN
freq NaN NaN NaN NaN NaN NaN ... NaN NaN
mean NaN NaN NaN NaN NaN NaN ... 7.730011 7.849815
std NaN NaN NaN NaN NaN NaN ... 8.420498 8.647050
min NaN NaN NaN NaN NaN NaN ... 0.119844 0.195619
25% NaN NaN NaN NaN NaN NaN ... 2.358006 2.213478
50% NaN NaN NaN NaN NaN NaN ... 4.981650 4.773126
75% NaN NaN NaN NaN NaN NaN ... 9.925983 10.116777
max NaN NaN NaN NaN NaN NaN ... 49.105725 47.595850
2017 2018 2019 2020 2021 \
count 144.000000 147.000000 144.000000 140.000000 138.000000
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 7.855498 7.396732 6.930146 7.106611 6.077451
std 9.308362 9.006356 9.146817 8.986869 8.004403
min 0.227487 0.217152 0.239743 0.128030 0.233335
25% 2.314672 2.237032 1.969493 2.176189 1.915288
50% 4.173785 3.686425 3.925993 4.040406 3.818139
75% 10.491818 9.603857 8.290779 7.787869 6.803305
max 54.823276 54.410017 58.930156 61.116041 58.961833
2022 2023 Unnamed: 68
count 125.000000 0.0 0.0
unique NaN NaN NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 5.457120 NaN NaN
std 8.075752 NaN NaN
min 0.242167 NaN NaN
25% 1.802530 NaN NaN
50% 3.094013 NaN NaN
75% 5.519875 NaN NaN
max 55.410965 NaN NaN
[11 rows x 69 columns]
Key identifier 'Country Code' is present.
================================================================================
Loaded file: ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv
------------------------------------------------------------
Head:
Country Name Country Code \
0 Aruba ABW
1 Africa Eastern and Southern AFE
2 Afghanistan AFG
3 Africa Western and Central AFW
4 Angola AGO
Indicator Name Indicator Code \
0 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
1 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
2 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
3 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
4 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
1960 1961 1962 1963 1964 1965 ... 2015 \
0 NaN NaN NaN NaN NaN NaN ... 56.626871
1 NaN NaN NaN NaN NaN 21.338823 ... 39.243754
2 NaN NaN NaN NaN NaN NaN ... 3.789807
3 6.15642 6.565687 7.206169 8.092604 8.47918 8.497916 ... 14.444677
4 NaN NaN NaN NaN NaN NaN ... 24.010776
2016 2017 2018 2019 2020 2021 \
0 57.221975 57.246514 55.965159 58.785552 81.748704 67.508471
1 38.851798 37.002628 36.910636 36.944584 37.407390 35.855730
2 3.775813 3.710381 3.442127 3.173545 2.994213 NaN
3 15.851837 14.886510 13.245001 13.212838 13.858455 14.663539
4 19.940831 15.920022 13.832310 14.256938 12.426118 9.279728
2022 2023 Unnamed: 68
0 61.356554 59.605470 NaN
1 33.341158 34.486264 NaN
2 NaN NaN NaN
3 15.193633 18.179781 NaN
4 7.536387 8.251449 NaN
[5 rows x 69 columns]
Random Sample (5 rows):
Country Name Country Code \
181 OECD members OED
119 Japan JPN
139 Lower middle income LMC
216 South Sudan SSD
45 Colombia COL
Indicator Name Indicator Code \
181 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
119 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
139 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
216 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
45 Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
1960 1961 1962 1963 1964 1965 ... \
181 35.600830 36.958438 39.534703 42.364015 43.565090 45.332785 ...
119 52.615776 53.157657 69.440322 76.328003 73.604740 76.437461 ...
139 8.775518 9.501488 9.848606 10.278243 10.256443 10.671527 ...
216 NaN NaN NaN NaN NaN NaN ...
45 20.540852 21.761283 22.468362 21.417330 19.051233 19.094696 ...
2015 2016 2017 2018 2019 2020 \
181 79.669484 80.275758 78.754733 78.539219 78.379325 83.875319
119 102.352703 103.966937 106.393369 106.939815 110.662583 120.933120
139 41.347454 42.412451 42.952315 43.254293 43.734788 47.250046
216 2.210693 4.370372 1.690735 1.518150 1.697267 1.918293
45 46.885140 47.035034 49.787316 45.856833 46.024506 51.978423
2021 2022 2023 Unnamed: 68
181 80.215862 76.528575 72.359684 NaN
119 120.710410 123.704557 122.596197 NaN
139 45.991398 NaN NaN NaN
216 2.696044 3.088838 5.174500 NaN
45 47.893131 44.210940 41.960095 NaN
[5 rows x 69 columns]
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 Indicator Code 266 non-null object
4 1960 87 non-null float64
5 1961 103 non-null float64
6 1962 107 non-null float64
7 1963 115 non-null float64
8 1964 115 non-null float64
9 1965 122 non-null float64
10 1966 125 non-null float64
11 1967 129 non-null float64
12 1968 128 non-null float64
13 1969 131 non-null float64
14 1970 133 non-null float64
15 1971 135 non-null float64
16 1972 137 non-null float64
17 1973 141 non-null float64
18 1974 141 non-null float64
19 1975 142 non-null float64
20 1976 145 non-null float64
21 1977 147 non-null float64
22 1978 146 non-null float64
23 1979 147 non-null float64
24 1980 152 non-null float64
25 1981 156 non-null float64
26 1982 158 non-null float64
27 1983 158 non-null float64
28 1984 160 non-null float64
29 1985 165 non-null float64
30 1986 165 non-null float64
31 1987 165 non-null float64
32 1988 165 non-null float64
33 1989 166 non-null float64
34 1990 164 non-null float64
35 1991 168 non-null float64
36 1992 178 non-null float64
37 1993 184 non-null float64
38 1994 185 non-null float64
39 1995 192 non-null float64
40 1996 193 non-null float64
41 1997 195 non-null float64
42 1998 197 non-null float64
43 1999 199 non-null float64
44 2000 199 non-null float64
45 2001 218 non-null float64
46 2002 220 non-null float64
47 2003 220 non-null float64
48 2004 223 non-null float64
49 2005 225 non-null float64
50 2006 226 non-null float64
51 2007 226 non-null float64
52 2008 227 non-null float64
53 2009 226 non-null float64
54 2010 227 non-null float64
55 2011 226 non-null float64
56 2012 227 non-null float64
57 2013 229 non-null float64
58 2014 228 non-null float64
59 2015 224 non-null float64
60 2016 223 non-null float64
61 2017 220 non-null float64
62 2018 217 non-null float64
63 2019 218 non-null float64
64 2020 215 non-null float64
65 2021 211 non-null float64
66 2022 200 non-null float64
67 2023 192 non-null float64
68 Unnamed: 68 0 non-null float64
dtypes: float64(65), object(4)
memory usage: 143.5+ KB
Describe:
Country Name Country Code \
count 266 266
unique 266 266
top Aruba ABW
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Indicator Name Indicator Code \
count 266 266
unique 1 1
top Domestic credit to private sector by banks (% ... FD.AST.PRVT.GD.ZS
freq 266 266
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
1960 1961 1962 1963 1964 1965 \
count 87.000000 103.000000 107.000000 115.000000 115.000000 122.000000
unique NaN NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN NaN
mean 15.475952 14.813447 15.445510 15.343547 15.567762 15.863615
std 13.448436 12.575928 13.392090 13.876734 13.766606 13.483375
min 0.544827 0.466949 0.453731 1.167678 0.938462 1.102151
25% 7.932804 8.072336 8.395453 7.432353 7.985473 8.598813
50% 11.022740 11.556590 12.329079 11.768265 11.798126 13.577323
75% 18.790493 16.465824 15.914307 16.707705 17.216308 17.765593
max 87.788473 88.665218 89.201134 89.098030 87.074196 88.176344
... 2015 2016 2017 2018 2019 \
count ... 224.000000 223.000000 220.000000 217.000000 218.000000
unique ... NaN NaN NaN NaN NaN
top ... NaN NaN NaN NaN NaN
freq ... NaN NaN NaN NaN NaN
mean ... 53.520044 53.782738 52.250055 51.420226 51.742856
std ... 38.920160 38.853394 37.992443 37.449592 38.283909
min ... 0.003335 0.003476 0.003207 0.003575 0.003808
25% ... 24.765330 24.819604 23.087517 22.606426 22.853363
50% ... 46.412046 44.699403 44.511795 44.913606 45.221871
75% ... 70.826424 70.264963 69.493819 69.038212 69.045210
max ... 243.304880 216.137571 223.371866 219.908123 236.740072
2020 2021 2022 2023 Unnamed: 68
count 215.000000 211.000000 200.000000 192.000000 0.0
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 57.685065 55.097171 54.506058 53.588065 NaN
std 44.552341 42.852191 44.970576 42.122318 NaN
min 0.003573 0.004205 0.003524 0.003224 NaN
25% 24.971330 23.749030 22.574429 24.361344 NaN
50% 49.612468 47.126111 44.510475 42.568348 NaN
75% 79.699249 71.500547 70.587313 68.377382 NaN
max 269.161635 259.183988 286.557423 248.758613 NaN
[11 rows x 69 columns]
Key identifier 'Country Code' is present.
================================================================================
Loaded file: ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv
------------------------------------------------------------
Head:
Country Name Country Code Indicator Name \
0 Aruba ABW Broad money (% of GDP)
1 Africa Eastern and Southern AFE Broad money (% of GDP)
2 Afghanistan AFG Broad money (% of GDP)
3 Africa Western and Central AFW Broad money (% of GDP)
4 Angola AGO Broad money (% of GDP)
Indicator Code 1960 1961 1962 1963 1964 \
0 FM.LBL.BMNY.GD.ZS NaN NaN NaN NaN NaN
1 FM.LBL.BMNY.GD.ZS NaN NaN NaN NaN NaN
2 FM.LBL.BMNY.GD.ZS NaN NaN NaN NaN NaN
3 FM.LBL.BMNY.GD.ZS 11.273418 11.439409 11.752169 11.783589 12.887036
4 FM.LBL.BMNY.GD.ZS NaN NaN NaN NaN NaN
1965 ... 2015 2016 2017 2018 2019 \
0 NaN ... 71.741782 78.012143 76.595426 74.635468 75.173399
1 33.646623 ... 47.348578 46.741995 45.364956 46.398586 49.025736
2 NaN ... 33.928053 37.222819 36.861261 36.607494 34.959214
3 12.485592 ... 23.503864 26.852796 25.304486 25.897945 25.566523
4 NaN ... 40.944660 39.162308 32.186521 30.646441 33.692734
2020 2021 2022 2023 Unnamed: 68
0 107.988156 102.322604 94.442529 85.554223 NaN
1 53.226637 51.604804 48.928047 50.066570 NaN
2 37.508189 NaN NaN NaN NaN
3 28.293487 29.258426 29.490384 NaN NaN
4 40.069948 25.862961 21.760579 25.234963 NaN
[5 rows x 69 columns]
Random Sample (5 rows):
Country Name Country Code Indicator Name \
181 OECD members OED Broad money (% of GDP)
119 Japan JPN Broad money (% of GDP)
139 Lower middle income LMC Broad money (% of GDP)
216 South Sudan SSD Broad money (% of GDP)
45 Colombia COL Broad money (% of GDP)
Indicator Code 1960 1961 1962 1963 1964 \
181 FM.LBL.BMNY.GD.ZS 54.250241 55.485249 57.362038 59.267399 59.606360
119 FM.LBL.BMNY.GD.ZS 47.337796 45.362752 64.366725 69.714004 68.331157
139 FM.LBL.BMNY.GD.ZS 22.292353 22.231069 22.532531 22.418549 21.864095
216 FM.LBL.BMNY.GD.ZS NaN NaN NaN NaN NaN
45 FM.LBL.BMNY.GD.ZS 20.436166 22.142599 23.506398 21.325430 20.368190
1965 ... 2015 2016 2017 2018 \
181 60.095144 ... 111.440285 115.539217 114.105902 113.899277
119 72.467907 ... 233.391713 239.560416 244.008404 248.102178
139 22.259931 ... 67.275361 69.416199 68.617666 66.417658
216 NaN ... 39.465805 65.459443 23.996198 23.545943
45 20.802473 ... 48.991437 48.919958 49.024116 48.016134
2019 2020 2021 2022 2023 Unnamed: 68
181 116.522288 134.422174 135.979491 123.690599 116.886399 NaN
119 252.681181 280.354948 281.945537 284.665258 275.556942 NaN
139 68.425541 77.786153 75.752666 NaN NaN NaN
216 18.734563 26.422626 34.201654 28.748201 32.530001 NaN
45 48.702760 58.278438 54.626138 48.941353 48.527707 NaN
[5 rows x 69 columns]
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 Indicator Code 266 non-null object
4 1960 82 non-null float64
5 1961 87 non-null float64
6 1962 100 non-null float64
7 1963 107 non-null float64
8 1964 110 non-null float64
9 1965 118 non-null float64
10 1966 125 non-null float64
11 1967 129 non-null float64
12 1968 128 non-null float64
13 1969 132 non-null float64
14 1970 133 non-null float64
15 1971 135 non-null float64
16 1972 137 non-null float64
17 1973 139 non-null float64
18 1974 140 non-null float64
19 1975 140 non-null float64
20 1976 144 non-null float64
21 1977 150 non-null float64
22 1978 151 non-null float64
23 1979 151 non-null float64
24 1980 155 non-null float64
25 1981 159 non-null float64
26 1982 161 non-null float64
27 1983 162 non-null float64
28 1984 159 non-null float64
29 1985 160 non-null float64
30 1986 166 non-null float64
31 1987 166 non-null float64
32 1988 166 non-null float64
33 1989 166 non-null float64
34 1990 162 non-null float64
35 1991 175 non-null float64
36 1992 179 non-null float64
37 1993 185 non-null float64
38 1994 186 non-null float64
39 1995 193 non-null float64
40 1996 192 non-null float64
41 1997 194 non-null float64
42 1998 197 non-null float64
43 1999 198 non-null float64
44 2000 199 non-null float64
45 2001 201 non-null float64
46 2002 203 non-null float64
47 2003 203 non-null float64
48 2004 204 non-null float64
49 2005 205 non-null float64
50 2006 205 non-null float64
51 2007 205 non-null float64
52 2008 205 non-null float64
53 2009 201 non-null float64
54 2010 202 non-null float64
55 2011 200 non-null float64
56 2012 202 non-null float64
57 2013 204 non-null float64
58 2014 202 non-null float64
59 2015 198 non-null float64
60 2016 197 non-null float64
61 2017 194 non-null float64
62 2018 190 non-null float64
63 2019 189 non-null float64
64 2020 185 non-null float64
65 2021 176 non-null float64
66 2022 163 non-null float64
67 2023 144 non-null float64
68 Unnamed: 68 0 non-null float64
dtypes: float64(65), object(4)
memory usage: 143.5+ KB
Describe:
Country Name Country Code Indicator Name Indicator Code \
count 266 266 266 266
unique 266 266 1 1
top Aruba ABW Broad money (% of GDP) FM.LBL.BMNY.GD.ZS
freq 1 1 266 266
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
1960 1961 1962 1963 1964 1965 \
count 82.000000 87.000000 100.000000 107.000000 110.000000 118.000000
unique NaN NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN NaN
mean 25.209203 25.299657 24.781990 25.005986 24.883755 24.931718
std 15.361464 14.935365 15.452984 16.004325 16.008286 15.372032
min 5.411826 6.273879 3.807102 4.093230 4.019443 3.656949
25% 14.800484 15.271210 14.484666 14.590600 13.844620 14.794640
50% 21.795679 22.076756 21.347714 21.038141 20.399630 21.859791
75% 28.302000 28.897332 27.294272 27.973334 28.627351 28.766586
max 90.322945 88.569555 90.875339 89.944936 90.786057 91.611927
... 2015 2016 2017 2018 2019 \
count ... 198.000000 197.000000 194.000000 190.000000 189.000000
unique ... NaN NaN NaN NaN NaN
top ... NaN NaN NaN NaN NaN
freq ... NaN NaN NaN NaN NaN
mean ... 66.891138 68.720970 67.101144 65.504612 66.950575
std ... 46.190391 47.843010 48.011439 45.545748 46.488325
min ... 0.015319 0.016039 0.015180 0.014727 0.014517
25% ... 36.980054 38.418176 37.508067 36.036647 36.767489
50% ... 55.525924 56.975874 55.573761 56.314776 58.327994
75% ... 82.581167 85.092258 82.856148 80.626471 82.140326
max ... 365.587623 375.038760 396.151789 386.098766 403.314069
2020 2021 2022 2023 Unnamed: 68
count 185.000000 176.000000 163.000000 144.000000 0.0
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 79.441599 77.413816 76.258102 78.991320 NaN
std 56.153572 53.881776 58.845881 57.121336 NaN
min 0.017933 0.019239 0.020397 0.019835 NaN
25% 42.754156 42.050586 40.660816 46.577363 NaN
50% 69.365472 67.231087 60.541165 65.178613 NaN
75% 94.376331 93.097793 90.278225 92.911668 NaN
max 454.703219 437.798214 461.607361 451.722287 NaN
[11 rows x 69 columns]
Key identifier 'Country Code' is present.
================================================================================
Loaded file: ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv
------------------------------------------------------------
Head:
Country Name Country Code \
0 Aruba ABW
1 Africa Eastern and Southern AFE
2 Afghanistan AFG
3 Africa Western and Central AFW
4 Angola AGO
Indicator Name Indicator Code 1960 \
0 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS NaN
1 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS NaN
2 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS NaN
3 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS NaN
4 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS NaN
1961 1962 1963 1964 1965 ... 2015 2016 2017 \
0 NaN NaN NaN NaN NaN ... NaN NaN NaN
1 NaN NaN NaN NaN NaN ... NaN NaN NaN
2 NaN NaN NaN NaN NaN ... NaN NaN NaN
3 NaN NaN NaN NaN NaN ... NaN NaN NaN
4 NaN NaN NaN NaN NaN ... 27.050924 27.543143 29.796248
2018 2019 2020 2021 2022 2023 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 25.286429 24.502063 34.232721 21.793201 18.862576 20.712318
Unnamed: 68
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
[5 rows x 69 columns]
Random Sample (5 rows):
Country Name Country Code \
181 OECD members OED
119 Japan JPN
139 Lower middle income LMC
216 South Sudan SSD
45 Colombia COL
Indicator Name Indicator Code \
181 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS
119 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS
139 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS
216 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS
45 Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS
1960 1961 1962 1963 1964 1965 ... 2015 2016 \
181 NaN NaN NaN NaN NaN NaN ... NaN NaN
119 NaN NaN NaN NaN NaN NaN ... 341.142456 346.832509
139 NaN NaN NaN NaN NaN NaN ... NaN NaN
216 NaN NaN NaN NaN NaN NaN ... NaN NaN
45 NaN NaN NaN NaN NaN NaN ... NaN NaN
2017 2018 2019 2020 2021 2022 \
181 NaN NaN NaN NaN NaN NaN
119 352.149391 350.615463 363.317971 393.451267 391.251954 385.902095
139 NaN NaN NaN NaN NaN NaN
216 NaN NaN NaN NaN NaN NaN
45 NaN NaN NaN NaN NaN NaN
2023 Unnamed: 68
181 NaN NaN
119 380.229411 NaN
139 NaN NaN
216 NaN NaN
45 NaN NaN
[5 rows x 69 columns]
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 Indicator Code 266 non-null object
4 1960 4 non-null float64
5 1961 4 non-null float64
6 1962 4 non-null float64
7 1963 5 non-null float64
8 1964 5 non-null float64
9 1965 7 non-null float64
10 1966 7 non-null float64
11 1967 7 non-null float64
12 1968 7 non-null float64
13 1969 7 non-null float64
14 1970 9 non-null float64
15 1971 9 non-null float64
16 1972 9 non-null float64
17 1973 10 non-null float64
18 1974 11 non-null float64
19 1975 12 non-null float64
20 1976 12 non-null float64
21 1977 12 non-null float64
22 1978 13 non-null float64
23 1979 14 non-null float64
24 1980 15 non-null float64
25 1981 15 non-null float64
26 1982 16 non-null float64
27 1983 16 non-null float64
28 1984 16 non-null float64
29 1985 16 non-null float64
30 1986 17 non-null float64
31 1987 16 non-null float64
32 1988 17 non-null float64
33 1989 17 non-null float64
34 1990 16 non-null float64
35 1991 16 non-null float64
36 1992 19 non-null float64
37 1993 19 non-null float64
38 1994 19 non-null float64
39 1995 18 non-null float64
40 1996 18 non-null float64
41 1997 19 non-null float64
42 1998 18 non-null float64
43 1999 18 non-null float64
44 2000 18 non-null float64
45 2001 28 non-null float64
46 2002 29 non-null float64
47 2003 29 non-null float64
48 2004 31 non-null float64
49 2005 32 non-null float64
50 2006 33 non-null float64
51 2007 35 non-null float64
52 2008 40 non-null float64
53 2009 43 non-null float64
54 2010 45 non-null float64
55 2011 46 non-null float64
56 2012 46 non-null float64
57 2013 46 non-null float64
58 2014 47 non-null float64
59 2015 49 non-null float64
60 2016 51 non-null float64
61 2017 53 non-null float64
62 2018 51 non-null float64
63 2019 52 non-null float64
64 2020 51 non-null float64
65 2021 50 non-null float64
66 2022 50 non-null float64
67 2023 46 non-null float64
68 Unnamed: 68 0 non-null float64
dtypes: float64(65), object(4)
memory usage: 143.5+ KB
Describe:
Country Name Country Code \
count 266 266
unique 266 266
top Aruba ABW
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Indicator Name Indicator Code \
count 266 266
unique 1 1
top Domestic credit provided by financial sector (... FS.AST.DOMS.GD.ZS
freq 266 266
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
1960 1961 1962 1963 1964 \
count 4.000000 4.000000 4.000000 5.000000 5.000000
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 60.497362 62.521741 61.944929 56.113973 56.905996
std 47.176389 51.034667 51.465583 49.792859 50.922520
min 17.660525 17.886843 17.265977 17.926408 17.442244
25% 20.668782 18.544845 17.428751 17.950779 19.012499
50% 61.500114 62.741075 61.999187 23.486982 22.778517
75% 101.328694 106.717971 106.515365 110.602849 112.648361
max 101.328694 106.717971 106.515365 110.602849 112.648361
1965 ... 2015 2016 2017 2018 \
count 7.000000 ... 49.000000 51.000000 53.000000 51.000000
unique NaN ... NaN NaN NaN NaN
top NaN ... NaN NaN NaN NaN
freq NaN ... NaN NaN NaN NaN
mean 52.963125 ... 79.208145 80.798387 80.434390 80.375325
std 47.494994 ... 61.696477 62.034122 62.782078 62.408452
min 4.845929 ... 19.729625 20.718422 13.931641 13.132823
25% 18.456752 ... 45.692450 44.067888 41.731155 44.625636
50% 24.809967 ... 60.970830 62.110829 63.610667 63.069215
75% 94.937869 ... 85.193303 90.067964 92.534391 87.350251
max 114.296734 ... 341.142456 346.832509 352.149391 350.615463
2019 2020 2021 2022 2023 \
count 52.000000 51.000000 50.000000 50.000000 46.000000
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 81.529471 93.489897 89.915029 85.519631 88.251415
std 65.708797 73.317691 75.845183 70.122398 71.235186
min 6.612871 10.500594 10.373594 8.847306 11.307526
25% 43.066322 49.293123 41.460091 39.196702 42.785166
50% 62.981791 72.797877 66.033797 62.362141 64.390229
75% 93.744828 116.971561 103.710998 106.026011 117.458229
max 363.317971 393.451267 391.251954 385.902095 380.229411
Unnamed: 68
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
[11 rows x 69 columns]
Key identifier 'Country Code' is present.
================================================================================
Loaded file: ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv
------------------------------------------------------------
Head:
Country Name Country Code \
0 Aruba ABW
1 Africa Eastern and Southern AFE
2 Afghanistan AFG
3 Africa Western and Central AFW
4 Angola AGO
Indicator Name Indicator Code 1960 \
0 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS NaN
1 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS NaN
2 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS NaN
3 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS 5.709638
4 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS NaN
1961 1962 1963 1964 1965 ... 2015 \
0 NaN NaN NaN NaN NaN ... 56.626871
1 NaN NaN NaN NaN 30.719143 ... 66.931181
2 NaN NaN NaN NaN NaN ... 3.981494
3 6.073112 7.644585 8.603106 8.710085 8.732643 ... 14.528591
4 NaN NaN NaN NaN NaN ... 25.240108
2016 2017 2018 2019 2020 2021 \
0 57.221975 57.246514 55.965159 58.785552 81.748704 67.508471
1 66.814006 66.527928 63.325700 62.015537 57.378682 51.662134
2 3.888322 3.841881 3.512317 3.229957 3.096598 NaN
3 15.958738 15.039485 13.390122 13.729785 14.513648 15.504186
4 21.098410 17.000685 14.935270 15.340922 13.724622 10.410725
2022 2023 Unnamed: 68
0 61.356554 59.605470 NaN
1 46.663674 49.050146 NaN
2 NaN NaN NaN
3 15.430622 NaN NaN
4 8.628867 9.533978 NaN
[5 rows x 69 columns]
Random Sample (5 rows):
Country Name Country Code \
181 OECD members OED
119 Japan JPN
139 Lower middle income LMC
216 South Sudan SSD
45 Colombia COL
Indicator Name Indicator Code \
181 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS
119 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS
139 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS
216 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS
45 Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS
1960 1961 1962 1963 1964 1965 ... \
181 59.580967 62.734946 NaN NaN NaN NaN ...
119 NaN NaN NaN NaN NaN NaN ...
139 8.636683 9.192425 9.668227 10.316468 10.272776 10.764414 ...
216 NaN NaN NaN NaN NaN NaN ...
45 22.948625 24.272706 24.328054 22.591354 20.516999 20.642111 ...
2015 2016 2017 2018 2019 2020 \
181 138.898742 141.095729 143.684590 138.587816 145.311010 161.933459
119 163.044821 163.917125 169.121093 168.827108 177.368855 195.473559
139 42.070213 43.118988 43.679190 43.673365 44.248514 47.772932
216 2.304606 4.424749 1.705071 1.526810 1.703505 1.925005
45 46.905394 47.054198 49.805421 45.870707 46.041578 51.998779
2021 2022 2023 Unnamed: 68
181 159.618777 144.532696 148.070632 NaN
119 195.783803 196.831842 195.679676 NaN
139 46.547109 NaN NaN NaN
216 2.750154 3.155151 5.386450 NaN
45 47.912406 44.227036 41.978583 NaN
[5 rows x 69 columns]
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 266 non-null object
1 Country Code 266 non-null object
2 Indicator Name 266 non-null object
3 Indicator Code 266 non-null object
4 1960 63 non-null float64
5 1961 64 non-null float64
6 1962 78 non-null float64
7 1963 82 non-null float64
8 1964 84 non-null float64
9 1965 90 non-null float64
10 1966 92 non-null float64
11 1967 92 non-null float64
12 1968 92 non-null float64
13 1969 92 non-null float64
14 1970 97 non-null float64
15 1971 98 non-null float64
16 1972 102 non-null float64
17 1973 105 non-null float64
18 1974 106 non-null float64
19 1975 107 non-null float64
20 1976 108 non-null float64
21 1977 122 non-null float64
22 1978 116 non-null float64
23 1979 122 non-null float64
24 1980 124 non-null float64
25 1981 128 non-null float64
26 1982 132 non-null float64
27 1983 133 non-null float64
28 1984 133 non-null float64
29 1985 135 non-null float64
30 1986 129 non-null float64
31 1987 126 non-null float64
32 1988 133 non-null float64
33 1989 132 non-null float64
34 1990 125 non-null float64
35 1991 130 non-null float64
36 1992 137 non-null float64
37 1993 142 non-null float64
38 1994 142 non-null float64
39 1995 153 non-null float64
40 1996 154 non-null float64
41 1997 158 non-null float64
42 1998 159 non-null float64
43 1999 158 non-null float64
44 2000 160 non-null float64
45 2001 186 non-null float64
46 2002 190 non-null float64
47 2003 190 non-null float64
48 2004 195 non-null float64
49 2005 198 non-null float64
50 2006 201 non-null float64
51 2007 204 non-null float64
52 2008 211 non-null float64
53 2009 214 non-null float64
54 2010 218 non-null float64
55 2011 220 non-null float64
56 2012 222 non-null float64
57 2013 223 non-null float64
58 2014 222 non-null float64
59 2015 222 non-null float64
60 2016 223 non-null float64
61 2017 217 non-null float64
62 2018 211 non-null float64
63 2019 213 non-null float64
64 2020 209 non-null float64
65 2021 205 non-null float64
66 2022 194 non-null float64
67 2023 173 non-null float64
68 Unnamed: 68 0 non-null float64
dtypes: float64(65), object(4)
memory usage: 143.5+ KB
Describe:
Country Name Country Code \
count 266 266
unique 266 266
top Aruba ABW
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Indicator Name Indicator Code \
count 266 266
unique 1 1
top Domestic credit to private sector (% of GDP) FS.AST.PRVT.GD.ZS
freq 266 266
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
1960 1961 1962 1963 1964 1965 ... \
count 63.000000 64.000000 78.000000 82.000000 84.000000 90.000000 ...
unique NaN NaN NaN NaN NaN NaN ...
top NaN NaN NaN NaN NaN NaN ...
freq NaN NaN NaN NaN NaN NaN ...
mean 18.132931 18.416991 16.379756 15.334164 15.333636 16.569537 ...
std 18.967027 19.324236 17.108516 15.653692 15.665213 16.251697 ...
min 0.544827 0.466949 0.453731 1.234835 0.938462 1.102151 ...
25% 7.702668 8.289130 7.647343 7.024061 7.303075 7.777960 ...
50% 11.022740 11.939333 12.073950 11.746205 11.161848 13.111879 ...
75% 18.806916 17.783034 16.538506 16.287199 16.324144 18.374659 ...
max 87.788473 88.665218 89.201134 89.098030 87.074196 88.176344 ...
2015 2016 2017 2018 2019 \
count 222.000000 223.000000 217.000000 211.000000 213.000000
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 58.257364 58.694564 57.721297 57.023178 57.550422
std 43.541823 43.527507 43.634690 42.977755 44.397592
min 0.003416 0.003542 0.003288 0.003641 0.003860
25% 25.683754 25.826029 23.926300 23.509202 23.300192
50% 48.470198 47.307130 45.769454 45.884530 46.448204
75% 77.948987 79.995269 80.194048 76.386587 78.158986
max 243.416337 216.242757 223.371866 219.908123 236.740072
2020 2021 2022 2023 Unnamed: 68
count 209.000000 205.000000 194.000000 173.000000 0.0
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 64.376563 61.717185 60.776676 62.742298 NaN
std 51.082897 49.806702 50.547629 49.508267 NaN
min 0.003620 0.004227 0.003539 0.003238 NaN
25% 26.032015 26.349745 25.206677 28.923168 NaN
50% 51.998779 47.776324 46.126711 48.534422 NaN
75% 88.022108 82.836098 84.188395 82.009431 NaN
max 269.161635 259.183988 286.557423 248.758613 NaN
[11 rows x 69 columns]
Key identifier 'Country Code' is present.
================================================================================
# Load Country-Level Meta Data files (no skiprows)
for f, d in load_and_inspect(country_meta_files, "Country-Level Meta Data", skip_rows=0):
country_meta_data[f] = d
==== Loading Country-Level Meta Data Files ====
Loaded file: ./data/Metadata_Country_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
------------------------------------------------------------
Head:
"Country Code" Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
Random Sample (5 rows):
"Country Code" Region IncomeGroup \
179 NZL East Asia & Pacific High income
115 ITA Europe & Central Asia High income
96 HKG East Asia & Pacific High income
24 BIH Europe & Central Asia Upper middle income
9 ARG Latin America & Caribbean Upper middle income
SpecialNotes \
179 Fiscal year end: March 31; reporting period fo...
115 A simple multiplier is used to convert the nat...
96 On 1 July 1997 China resumed its exercise of s...
24 NaN
9 The World Bank systematically assesses the app...
TableName Unnamed: 5
179 New Zealand NaN
115 Italy NaN
96 Hong Kong SAR, China NaN
24 Bosnia and Herzegovina NaN
9 Argentina NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "Country Code" 265 non-null object
1 Region 217 non-null object
2 IncomeGroup 216 non-null object
3 SpecialNotes 127 non-null object
4 TableName 265 non-null object
5 Unnamed: 5 0 non-null float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB
Describe:
"Country Code" Region IncomeGroup \
count 265 217 216
unique 265 7 4
top ABW Europe & Central Asia High income
freq 1 58 85
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
SpecialNotes TableName \
count 127 265
unique 113 265
top Fiscal year end: March 31; reporting period fo... Aruba
freq 7 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Unnamed: 5
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Country_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
------------------------------------------------------------
Head:
"Country Code" Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
Random Sample (5 rows):
"Country Code" Region IncomeGroup \
179 NZL East Asia & Pacific High income
115 ITA Europe & Central Asia High income
96 HKG East Asia & Pacific High income
24 BIH Europe & Central Asia Upper middle income
9 ARG Latin America & Caribbean Upper middle income
SpecialNotes \
179 Fiscal year end: March 31; reporting period fo...
115 A simple multiplier is used to convert the nat...
96 On 1 July 1997 China resumed its exercise of s...
24 NaN
9 The World Bank systematically assesses the app...
TableName Unnamed: 5
179 New Zealand NaN
115 Italy NaN
96 Hong Kong SAR, China NaN
24 Bosnia and Herzegovina NaN
9 Argentina NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "Country Code" 265 non-null object
1 Region 217 non-null object
2 IncomeGroup 216 non-null object
3 SpecialNotes 127 non-null object
4 TableName 265 non-null object
5 Unnamed: 5 0 non-null float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB
Describe:
"Country Code" Region IncomeGroup \
count 265 217 216
unique 265 7 4
top ABW Europe & Central Asia High income
freq 1 58 85
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
SpecialNotes TableName \
count 127 265
unique 113 265
top Fiscal year end: March 31; reporting period fo... Aruba
freq 7 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Unnamed: 5
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Country_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv
------------------------------------------------------------
Head:
"Country Code" Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
Random Sample (5 rows):
"Country Code" Region IncomeGroup \
179 NZL East Asia & Pacific High income
115 ITA Europe & Central Asia High income
96 HKG East Asia & Pacific High income
24 BIH Europe & Central Asia Upper middle income
9 ARG Latin America & Caribbean Upper middle income
SpecialNotes \
179 Fiscal year end: March 31; reporting period fo...
115 A simple multiplier is used to convert the nat...
96 On 1 July 1997 China resumed its exercise of s...
24 NaN
9 The World Bank systematically assesses the app...
TableName Unnamed: 5
179 New Zealand NaN
115 Italy NaN
96 Hong Kong SAR, China NaN
24 Bosnia and Herzegovina NaN
9 Argentina NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "Country Code" 265 non-null object
1 Region 217 non-null object
2 IncomeGroup 216 non-null object
3 SpecialNotes 127 non-null object
4 TableName 265 non-null object
5 Unnamed: 5 0 non-null float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB
Describe:
"Country Code" Region IncomeGroup \
count 265 217 216
unique 265 7 4
top ABW Europe & Central Asia High income
freq 1 58 85
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
SpecialNotes TableName \
count 127 265
unique 113 265
top Fiscal year end: March 31; reporting period fo... Aruba
freq 7 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Unnamed: 5
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Country_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv
------------------------------------------------------------
Head:
"Country Code" Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
Random Sample (5 rows):
"Country Code" Region IncomeGroup \
179 NZL East Asia & Pacific High income
115 ITA Europe & Central Asia High income
96 HKG East Asia & Pacific High income
24 BIH Europe & Central Asia Upper middle income
9 ARG Latin America & Caribbean Upper middle income
SpecialNotes \
179 Fiscal year end: March 31; reporting period fo...
115 A simple multiplier is used to convert the nat...
96 On 1 July 1997 China resumed its exercise of s...
24 NaN
9 The World Bank systematically assesses the app...
TableName Unnamed: 5
179 New Zealand NaN
115 Italy NaN
96 Hong Kong SAR, China NaN
24 Bosnia and Herzegovina NaN
9 Argentina NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "Country Code" 265 non-null object
1 Region 217 non-null object
2 IncomeGroup 216 non-null object
3 SpecialNotes 127 non-null object
4 TableName 265 non-null object
5 Unnamed: 5 0 non-null float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB
Describe:
"Country Code" Region IncomeGroup \
count 265 217 216
unique 265 7 4
top ABW Europe & Central Asia High income
freq 1 58 85
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
SpecialNotes TableName \
count 127 265
unique 113 265
top Fiscal year end: March 31; reporting period fo... Aruba
freq 7 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Unnamed: 5
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Country_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv
------------------------------------------------------------
Head:
"Country Code" Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
Random Sample (5 rows):
"Country Code" Region IncomeGroup \
179 NZL East Asia & Pacific High income
115 ITA Europe & Central Asia High income
96 HKG East Asia & Pacific High income
24 BIH Europe & Central Asia Upper middle income
9 ARG Latin America & Caribbean Upper middle income
SpecialNotes \
179 Fiscal year end: March 31; reporting period fo...
115 A simple multiplier is used to convert the nat...
96 On 1 July 1997 China resumed its exercise of s...
24 NaN
9 The World Bank systematically assesses the app...
TableName Unnamed: 5
179 New Zealand NaN
115 Italy NaN
96 Hong Kong SAR, China NaN
24 Bosnia and Herzegovina NaN
9 Argentina NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "Country Code" 265 non-null object
1 Region 217 non-null object
2 IncomeGroup 216 non-null object
3 SpecialNotes 127 non-null object
4 TableName 265 non-null object
5 Unnamed: 5 0 non-null float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB
Describe:
"Country Code" Region IncomeGroup \
count 265 217 216
unique 265 7 4
top ABW Europe & Central Asia High income
freq 1 58 85
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
SpecialNotes TableName \
count 127 265
unique 113 265
top Fiscal year end: March 31; reporting period fo... Aruba
freq 7 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Unnamed: 5
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Country_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv
------------------------------------------------------------
Head:
"Country Code" Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
Random Sample (5 rows):
"Country Code" Region IncomeGroup \
179 NZL East Asia & Pacific High income
115 ITA Europe & Central Asia High income
96 HKG East Asia & Pacific High income
24 BIH Europe & Central Asia Upper middle income
9 ARG Latin America & Caribbean Upper middle income
SpecialNotes \
179 Fiscal year end: March 31; reporting period fo...
115 A simple multiplier is used to convert the nat...
96 On 1 July 1997 China resumed its exercise of s...
24 NaN
9 The World Bank systematically assesses the app...
TableName Unnamed: 5
179 New Zealand NaN
115 Italy NaN
96 Hong Kong SAR, China NaN
24 Bosnia and Herzegovina NaN
9 Argentina NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "Country Code" 265 non-null object
1 Region 217 non-null object
2 IncomeGroup 216 non-null object
3 SpecialNotes 127 non-null object
4 TableName 265 non-null object
5 Unnamed: 5 0 non-null float64
dtypes: float64(1), object(5)
memory usage: 12.6+ KB
Describe:
"Country Code" Region IncomeGroup \
count 265 217 216
unique 265 7 4
top ABW Europe & Central Asia High income
freq 1 58 85
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
SpecialNotes TableName \
count 127 265
unique 113 265
top Fiscal year end: March 31; reporting period fo... Aruba
freq 7 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Unnamed: 5
count 0.0
unique NaN
top NaN
freq NaN
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Warning: No key identifier found in this file.
================================================================================
### Load Indicator-Level Meta Data files (no skiprows)
# Load Indicator-Level Meta Data files (no skiprows)
for f, d in load_and_inspect(indicator_meta_files, "Indicator-Level Meta Data", skip_rows=0):
indicator_meta_data[f] = d
==== Loading Indicator-Level Meta Data Files ====
Loaded file: ./data/Metadata_Indicator_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
------------------------------------------------------------
Head:
"INDICATOR_CODE" INDICATOR_NAME \
0 BX.KLT.DINV.WD.GD.ZS Foreign direct investment, net inflows (% of GDP)
SOURCE_NOTE \
0 Foreign direct investment are the net inflows ...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Random Sample (1 rows):
"INDICATOR_CODE" INDICATOR_NAME \
0 BX.KLT.DINV.WD.GD.ZS Foreign direct investment, net inflows (% of GDP)
SOURCE_NOTE \
0 Foreign direct investment are the net inflows ...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "INDICATOR_CODE" 1 non-null object
1 INDICATOR_NAME 1 non-null object
2 SOURCE_NOTE 1 non-null object
3 SOURCE_ORGANIZATION 1 non-null object
4 Unnamed: 4 0 non-null float64
dtypes: float64(1), object(4)
memory usage: 172.0+ bytes
Describe:
"INDICATOR_CODE" \
count 1
unique 1
top BX.KLT.DINV.WD.GD.ZS
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
INDICATOR_NAME \
count 1
unique 1
top Foreign direct investment, net inflows (% of GDP)
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_NOTE \
count 1
unique 1
top Foreign direct investment are the net inflows ...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_ORGANIZATION Unnamed: 4
count 1 0.0
unique 1 NaN
top International Monetary Fund, International Fin... NaN
freq 1 NaN
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Indicator_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
------------------------------------------------------------
Head:
"INDICATOR_CODE" INDICATOR_NAME \
0 FB.AST.NPER.ZS Bank nonperforming loans to total gross loans (%)
SOURCE_NOTE \
0 Bank nonperforming loans to total gross loans ...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, Financial Soundne... NaN
Random Sample (1 rows):
"INDICATOR_CODE" INDICATOR_NAME \
0 FB.AST.NPER.ZS Bank nonperforming loans to total gross loans (%)
SOURCE_NOTE \
0 Bank nonperforming loans to total gross loans ...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, Financial Soundne... NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "INDICATOR_CODE" 1 non-null object
1 INDICATOR_NAME 1 non-null object
2 SOURCE_NOTE 1 non-null object
3 SOURCE_ORGANIZATION 1 non-null object
4 Unnamed: 4 0 non-null float64
dtypes: float64(1), object(4)
memory usage: 172.0+ bytes
Describe:
"INDICATOR_CODE" INDICATOR_NAME \
count 1 1
unique 1 1
top FB.AST.NPER.ZS Bank nonperforming loans to total gross loans (%)
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
SOURCE_NOTE \
count 1
unique 1
top Bank nonperforming loans to total gross loans ...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_ORGANIZATION Unnamed: 4
count 1 0.0
unique 1 NaN
top International Monetary Fund, Financial Soundne... NaN
freq 1 NaN
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Indicator_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv
------------------------------------------------------------
Head:
"INDICATOR_CODE" INDICATOR_NAME \
0 FD.AST.PRVT.GD.ZS Domestic credit to private sector by banks (% ...
SOURCE_NOTE \
0 Domestic credit to private sector by banks ref...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Random Sample (1 rows):
"INDICATOR_CODE" INDICATOR_NAME \
0 FD.AST.PRVT.GD.ZS Domestic credit to private sector by banks (% ...
SOURCE_NOTE \
0 Domestic credit to private sector by banks ref...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "INDICATOR_CODE" 1 non-null object
1 INDICATOR_NAME 1 non-null object
2 SOURCE_NOTE 1 non-null object
3 SOURCE_ORGANIZATION 1 non-null object
4 Unnamed: 4 0 non-null float64
dtypes: float64(1), object(4)
memory usage: 172.0+ bytes
Describe:
"INDICATOR_CODE" INDICATOR_NAME \
count 1 1
unique 1 1
top FD.AST.PRVT.GD.ZS Domestic credit to private sector by banks (% ...
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
SOURCE_NOTE \
count 1
unique 1
top Domestic credit to private sector by banks ref...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_ORGANIZATION Unnamed: 4
count 1 0.0
unique 1 NaN
top International Monetary Fund, International Fin... NaN
freq 1 NaN
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Indicator_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv
------------------------------------------------------------
Head:
"INDICATOR_CODE" INDICATOR_NAME \
0 FM.LBL.BMNY.GD.ZS Broad money (% of GDP)
SOURCE_NOTE \
0 Broad money (IFS line 35L..ZK) is the sum of c...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Random Sample (1 rows):
"INDICATOR_CODE" INDICATOR_NAME \
0 FM.LBL.BMNY.GD.ZS Broad money (% of GDP)
SOURCE_NOTE \
0 Broad money (IFS line 35L..ZK) is the sum of c...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "INDICATOR_CODE" 1 non-null object
1 INDICATOR_NAME 1 non-null object
2 SOURCE_NOTE 1 non-null object
3 SOURCE_ORGANIZATION 1 non-null object
4 Unnamed: 4 0 non-null float64
dtypes: float64(1), object(4)
memory usage: 172.0+ bytes
Describe:
"INDICATOR_CODE" INDICATOR_NAME \
count 1 1
unique 1 1
top FM.LBL.BMNY.GD.ZS Broad money (% of GDP)
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
SOURCE_NOTE \
count 1
unique 1
top Broad money (IFS line 35L..ZK) is the sum of c...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_ORGANIZATION Unnamed: 4
count 1 0.0
unique 1 NaN
top International Monetary Fund, International Fin... NaN
freq 1 NaN
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Indicator_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv
------------------------------------------------------------
Head:
"INDICATOR_CODE" INDICATOR_NAME \
0 FS.AST.DOMS.GD.ZS Domestic credit provided by financial sector (...
SOURCE_NOTE \
0 Domestic credit provided by the financial sect...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Random Sample (1 rows):
"INDICATOR_CODE" INDICATOR_NAME \
0 FS.AST.DOMS.GD.ZS Domestic credit provided by financial sector (...
SOURCE_NOTE \
0 Domestic credit provided by the financial sect...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "INDICATOR_CODE" 1 non-null object
1 INDICATOR_NAME 1 non-null object
2 SOURCE_NOTE 1 non-null object
3 SOURCE_ORGANIZATION 1 non-null object
4 Unnamed: 4 0 non-null float64
dtypes: float64(1), object(4)
memory usage: 172.0+ bytes
Describe:
"INDICATOR_CODE" INDICATOR_NAME \
count 1 1
unique 1 1
top FS.AST.DOMS.GD.ZS Domestic credit provided by financial sector (...
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
SOURCE_NOTE \
count 1
unique 1
top Domestic credit provided by the financial sect...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_ORGANIZATION Unnamed: 4
count 1 0.0
unique 1 NaN
top International Monetary Fund, International Fin... NaN
freq 1 NaN
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Warning: No key identifier found in this file.
================================================================================
Loaded file: ./data/Metadata_Indicator_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv
------------------------------------------------------------
Head:
"INDICATOR_CODE" INDICATOR_NAME \
0 FS.AST.PRVT.GD.ZS Domestic credit to private sector (% of GDP)
SOURCE_NOTE \
0 Domestic credit to private sector refers to fi...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Random Sample (1 rows):
"INDICATOR_CODE" INDICATOR_NAME \
0 FS.AST.PRVT.GD.ZS Domestic credit to private sector (% of GDP)
SOURCE_NOTE \
0 Domestic credit to private sector refers to fi...
SOURCE_ORGANIZATION Unnamed: 4
0 International Monetary Fund, International Fin... NaN
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 "INDICATOR_CODE" 1 non-null object
1 INDICATOR_NAME 1 non-null object
2 SOURCE_NOTE 1 non-null object
3 SOURCE_ORGANIZATION 1 non-null object
4 Unnamed: 4 0 non-null float64
dtypes: float64(1), object(4)
memory usage: 172.0+ bytes
Describe:
"INDICATOR_CODE" INDICATOR_NAME \
count 1 1
unique 1 1
top FS.AST.PRVT.GD.ZS Domestic credit to private sector (% of GDP)
freq 1 1
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
SOURCE_NOTE \
count 1
unique 1
top Domestic credit to private sector refers to fi...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
SOURCE_ORGANIZATION Unnamed: 4
count 1 0.0
unique 1 NaN
top International Monetary Fund, International Fin... NaN
freq 1 NaN
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
Warning: No key identifier found in this file.
================================================================================
def assess_data_quality(data_dict, group_name):
print(f"\n==== Assessing Data Quality for {group_name} ====\n")
for file, df in data_dict.items():
print(f"File: {file}")
print("-" * 60)
# 1. Missing Values
print("Missing Values per Column:")
missing = df.isnull().sum()
print(missing)
# 2. Data Types
print("\nData Types:")
print(df.dtypes)
# 3. Ranges for Numeric Columns
num_cols = df.select_dtypes(include=['number']).columns
if len(num_cols) > 0:
print("\nNumeric Column Ranges:")
for col in num_cols:
col_min = df[col].min()
col_max = df[col].max()
print(f" - {col}: min = {col_min}, max = {col_max}")
else:
print("\nNo numeric columns to assess ranges.")
print("=" * 80, "\n")
# Assess data quality for each group
# For Primary Financial Indicators files
assess_data_quality(primary_data, "Primary Financial Indicators")
# For Country-Level Meta Data files
assess_data_quality(country_meta_data, "Country-Level Meta Data")
# For Indicator-Level Meta Data files
assess_data_quality(indicator_meta_data, "Indicator-Level Meta Data")
==== Assessing Data Quality for Primary Financial Indicators ====
File: ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
------------------------------------------------------------
Missing Values per Column:
Country Name 0
Country Code 0
Indicator Name 0
Indicator Code 0
1960 266
...
2020 33
2021 33
2022 35
2023 44
Unnamed: 68 266
Length: 69, dtype: int64
Data Types:
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2020 float64
2021 float64
2022 float64
2023 float64
Unnamed: 68 float64
Length: 69, dtype: object
Numeric Column Ranges:
- 1960: min = nan, max = nan
- 1961: min = nan, max = nan
- 1962: min = nan, max = nan
- 1963: min = nan, max = nan
- 1964: min = nan, max = nan
- 1965: min = nan, max = nan
- 1966: min = nan, max = nan
- 1967: min = nan, max = nan
- 1968: min = nan, max = nan
- 1969: min = nan, max = nan
- 1970: min = -1.81884321571481, max = 47.76106286027
- 1971: min = -9.94310820508451, max = 15.1182129880903
- 1972: min = -2.68999382133914, max = 16.1443784008669
- 1973: min = -4.18742324543187, max = 19.3400854798701
- 1974: min = -8.21878721946481, max = 22.6818469323213
- 1975: min = -17.4447517681919, max = 31.2371601870178
- 1976: min = -2.82944990525817, max = 13.9388951150516
- 1977: min = -3.52299298519096, max = 25.3264955856564
- 1978: min = -7.83794017417645, max = 20.1731525441121
- 1979: min = -3.5015974313099, max = 15.5941398900164
- 1980: min = -11.9649880051652, max = 18.1369565217391
- 1981: min = -2.04574537666979, max = 34.0159560054789
- 1982: min = -6.01110277644262, max = 14.4545454545455
- 1983: min = -2.90604150668372, max = 15.5397339699864
- 1984: min = -8.20601851851852, max = 6.59196673143056
- 1985: min = -3.61271407759756, max = 7.20672627785934
- 1986: min = -28.6242557290297, max = 9.20162995625447
- 1987: min = -13.2244897857143, max = 13.5578926036223
- 1988: min = -14.7286821705426, max = 27.9447173167678
- 1989: min = -55.234058219683, max = 83.4338038789266
- 1990: min = -2.98142869155218, max = 22.454824411865
- 1991: min = -2.90401665495103, max = 105.273563218391
- 1992: min = -6.09661673223004, max = 39.2453267428425
- 1993: min = -6.89760875290975, max = 31.7593516209476
- 1994: min = -5.8744394618834, max = 69.0627836611195
- 1995: min = -7.19807121661721, max = 89.4756175644807
- 1996: min = -15.3381430363864, max = 161.823760047878
- 1997: min = -2.87977200424178, max = 167.329165258533
- 1998: min = -202.823876529477, max = 23.8082020678648
- 1999: min = -7.59550090453841, max = 70.304033499321
- 2000: min = -6.8287271729666, max = 799.33394273342
- 2001: min = -32.2326777041943, max = 376.801268564647
- 2002: min = -11.1735603550901, max = 255.307944331899
- 2003: min = -3.81179323818467, max = 563.548407744583
- 2004: min = -4.68763897846137, max = 122.36024411059
- 2005: min = -8.80493563521364, max = 341.007650578171
- 2006: min = -6.22149014709964, max = 1114.93238551093
- 2007: min = -17.5903165005139, max = 728.865659684804
- 2008: min = -6.54973190348526, max = 319.097324077722
- 2009: min = -3.48204515637954, max = 664.573767394696
- 2010: min = -15.8521934475209, max = 598.033933891729
- 2011: min = -5.17904034375746, max = 742.553401614666
- 2012: min = -22.4122807694915, max = 542.265724954567
- 2013: min = -5.65561784857252, max = 432.859399519303
- 2014: min = -8.2672072479104, max = 394.898959609935
- 2015: min = -7.41855700744106, max = 1509.00209846042
- 2016: min = -37.1726532341221, max = 972.693538598386
- 2017: min = -28.3072317012805, max = 602.374526148965
- 2018: min = -1303.10826691135, max = 452.221039539527
- 2019: min = -13.6740358407126, max = 252.920356064947
- 2020: min = -101.833102140082, max = 200.831508487808
- 2021: min = -14.4091790381318, max = 433.860897111274
- 2022: min = -360.352806534785, max = 126.275787596203
- 2023: min = -34.0103605718433, max = 286.040067701696
- Unnamed: 68: min = nan, max = nan
================================================================================
File: ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
------------------------------------------------------------
Missing Values per Column:
Country Name 0
Country Code 0
Indicator Name 0
Indicator Code 0
1960 266
...
2020 126
2021 128
2022 141
2023 266
Unnamed: 68 266
Length: 69, dtype: int64
Data Types:
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2020 float64
2021 float64
2022 float64
2023 float64
Unnamed: 68 float64
Length: 69, dtype: object
Numeric Column Ranges:
- 1960: min = nan, max = nan
- 1961: min = nan, max = nan
- 1962: min = nan, max = nan
- 1963: min = nan, max = nan
- 1964: min = nan, max = nan
- 1965: min = nan, max = nan
- 1966: min = nan, max = nan
- 1967: min = nan, max = nan
- 1968: min = nan, max = nan
- 1969: min = nan, max = nan
- 1970: min = nan, max = nan
- 1971: min = nan, max = nan
- 1972: min = nan, max = nan
- 1973: min = nan, max = nan
- 1974: min = nan, max = nan
- 1975: min = nan, max = nan
- 1976: min = nan, max = nan
- 1977: min = nan, max = nan
- 1978: min = nan, max = nan
- 1979: min = nan, max = nan
- 1980: min = nan, max = nan
- 1981: min = nan, max = nan
- 1982: min = nan, max = nan
- 1983: min = nan, max = nan
- 1984: min = nan, max = nan
- 1985: min = nan, max = nan
- 1986: min = nan, max = nan
- 1987: min = nan, max = nan
- 1988: min = nan, max = nan
- 1989: min = nan, max = nan
- 1990: min = nan, max = nan
- 1991: min = nan, max = nan
- 1992: min = nan, max = nan
- 1993: min = nan, max = nan
- 1994: min = nan, max = nan
- 1995: min = nan, max = nan
- 1996: min = nan, max = nan
- 1997: min = nan, max = nan
- 1998: min = nan, max = nan
- 1999: min = nan, max = nan
- 2000: min = 6.83028179331357, max = 21.215964833505
- 2001: min = 1.61915877786992, max = 17.9071007297324
- 2002: min = 1.8229968378738, max = 11.0172196258194
- 2003: min = 1.63483787843428, max = 9.49739951360216
- 2004: min = 1.20249750508307, max = 7.18269460394702
- 2005: min = 0.480366744116291, max = 57.9897716519799
- 2006: min = 0.420325955427002, max = 59.7566320355279
- 2007: min = 0.226516077373678, max = 48.1183737206673
- 2008: min = 0.409453887918507, max = 13.675346678958
- 2009: min = 0.56999382899243, max = 37.2532872694237
- 2010: min = 0.247232603084002, max = 49.9013229069412
- 2011: min = 0.344719585672998, max = 34.5595714508246
- 2012: min = 0.145588493246799, max = 21.3116231755291
- 2013: min = 0.0923353115576881, max = 42.3009858312383
- 2014: min = 0.118444607476519, max = 44.972072758843
- 2015: min = 0.119843705534916, max = 49.1057251543943
- 2016: min = 0.195618683773704, max = 47.5958495651832
- 2017: min = 0.227486975366639, max = 54.8232758416853
- 2018: min = 0.217152000015563, max = 54.4100172630314
- 2019: min = 0.239742734729696, max = 58.9301560480268
- 2020: min = 0.12803046468493, max = 61.1160411121777
- 2021: min = 0.233334670386676, max = 58.9618327392133
- 2022: min = 0.24216660171932, max = 55.4109645865667
- 2023: min = nan, max = nan
- Unnamed: 68: min = nan, max = nan
================================================================================
File: ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv
------------------------------------------------------------
Missing Values per Column:
Country Name 0
Country Code 0
Indicator Name 0
Indicator Code 0
1960 179
...
2020 51
2021 55
2022 66
2023 74
Unnamed: 68 266
Length: 69, dtype: int64
Data Types:
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2020 float64
2021 float64
2022 float64
2023 float64
Unnamed: 68 float64
Length: 69, dtype: object
Numeric Column Ranges:
- 1960: min = 0.544827291383113, max = 87.7884734668407
- 1961: min = 0.466948987073388, max = 88.6652176956245
- 1962: min = 0.453731369700999, max = 89.2011338254893
- 1963: min = 1.16767764298094, max = 89.098030054666
- 1964: min = 0.93846158177515, max = 87.0741961847816
- 1965: min = 1.10215068577294, max = 88.1763437244437
- 1966: min = 1.61962652714933, max = 87.6291865842292
- 1967: min = 0.715635832425586, max = 87.5536183761206
- 1968: min = 1.03124359221861, max = 89.5841007877957
- 1969: min = 1.05566686542523, max = 91.07487346495
- 1970: min = 1.68390804357819, max = 98.6003736940941
- 1971: min = 1.9256648624975, max = 108.902797869506
- 1972: min = 1.99427319803606, max = 117.807142470452
- 1973: min = 1.7017680542504, max = 116.689574927676
- 1974: min = 0.629137194369974, max = 111.534283541547
- 1975: min = 0.286230893153044, max = 114.410289976096
- 1976: min = 0.300012336518998, max = 114.896624089411
- 1977: min = 0.349506789561202, max = 112.889931497273
- 1978: min = 0.396823777975898, max = 112.385816838088
- 1979: min = 0.386916530190786, max = 113.6625656854
- 1980: min = 0.223452174364765, max = 114.940140794194
- 1981: min = 0.232697891612085, max = 117.438575115279
- 1982: min = 0.186822225170531, max = 122.157352579406
- 1983: min = 0.186403126790989, max = 126.330115447014
- 1984: min = 0.189839398032526, max = 129.001550379837
- 1985: min = 0.196159633532841, max = 132.411346455136
- 1986: min = 0.190245043198352, max = 137.853049784187
- 1987: min = 0.198972477553255, max = 151.49984059924
- 1988: min = 0.154218406276182, max = 158.49262632534
- 1989: min = 0.335094877540951, max = 164.750750651349
- 1990: min = 0.681521397391369, max = 167.665774034886
- 1991: min = 0.597685342962604, max = 166.299871657358
- 1992: min = 0.767264257415784, max = 169.429936298152
- 1993: min = 0.913687944593067, max = 171.739896586269
- 1994: min = 1.00783696764076, max = 173.104938364636
- 1995: min = 0.885382690861098, max = 4490.86769865028
- 1996: min = 0.74167908145002, max = 2887.01575367049
- 1997: min = 0.55281250319524, max = 2842.96977830542
- 1998: min = 1.96258997070586, max = 4507.83185075911
- 1999: min = 1.61553114401575, max = 4537.21457967783
- 2000: min = 0.732211193094999, max = 4370.41544180119
- 2001: min = 0.000987334632046, max = 148.976631278091
- 2002: min = 0.0012455196791345, max = 149.493012840461
- 2003: min = 0.0018169476520058, max = 147.195228621501
- 2004: min = 0.0022066059761882, max = 157.797767746874
- 2005: min = 0.0020945776205651, max = 239.561823087085
- 2006: min = 0.002160517180355, max = 304.575112745362
- 2007: min = 0.0027800888279359, max = 246.609307337934
- 2008: min = 0.0036176565615617, max = 212.093506733987
- 2009: min = 0.0052906257539571, max = 226.238177594747
- 2010: min = 0.0049676359057579, max = 234.128270166471
- 2011: min = 0.0048894627951967, max = 242.425310610419
- 2012: min = 0.0035098001776334, max = 249.095015986358
- 2013: min = 0.0030138963255731, max = 254.551658820734
- 2014: min = 0.0030961252544575, max = 251.710266759916
- 2015: min = 0.0033345375033028, max = 243.304880425395
- 2016: min = 0.0034763588693557, max = 216.137570701221
- 2017: min = 0.0032071511828279, max = 223.371866412043
- 2018: min = 0.0035747014865994, max = 219.908123250485
- 2019: min = 0.0038077374525084, max = 236.740071570624
- 2020: min = 0.0035727260266691, max = 269.161635493478
- 2021: min = 0.0042047248946708, max = 259.183988203515
- 2022: min = 0.0035239298732376, max = 286.55742251628
- 2023: min = 0.0032244627076846, max = 248.758612570897
- Unnamed: 68: min = nan, max = nan
================================================================================
File: ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv
------------------------------------------------------------
Missing Values per Column:
Country Name 0
Country Code 0
Indicator Name 0
Indicator Code 0
1960 184
...
2020 81
2021 90
2022 103
2023 122
Unnamed: 68 266
Length: 69, dtype: int64
Data Types:
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2020 float64
2021 float64
2022 float64
2023 float64
Unnamed: 68 float64
Length: 69, dtype: object
Numeric Column Ranges:
- 1960: min = 5.41182559803319, max = 90.3229450564068
- 1961: min = 6.27387869811424, max = 88.569555469528
- 1962: min = 3.80710177055953, max = 90.8753385318001
- 1963: min = 4.09323008716412, max = 89.9449360237523
- 1964: min = 4.01944288835362, max = 90.7860573214119
- 1965: min = 3.65694920415135, max = 91.6119272756964
- 1966: min = 4.24831480233389, max = 91.6305598686251
- 1967: min = 4.7472868485683, max = 91.8319740088531
- 1968: min = 5.02135528984759, max = 96.2991268661135
- 1969: min = 5.39463797752886, max = 101.526489360264
- 1970: min = 5.34182464791585, max = 100.503541339364
- 1971: min = 6.35406856536997, max = 112.096523735482
- 1972: min = 6.85761928749791, max = 122.356887953573
- 1973: min = 7.34999207678534, max = 119.552539692199
- 1974: min = 0.649569629708066, max = 114.273734341888
- 1975: min = 0.275608482875548, max = 120.969393404502
- 1976: min = 0.392924854306579, max = 124.420808001124
- 1977: min = 0.391213776414474, max = 126.977614302719
- 1978: min = 0.453267128932197, max = 131.928263584428
- 1979: min = 0.409300240416835, max = 135.294717013188
- 1980: min = 0.299054853843612, max = 136.841459895289
- 1981: min = 0.267337001215498, max = 142.627211315878
- 1982: min = 0.326999337779305, max = 148.322840620485
- 1983: min = 0.379739422607469, max = 153.85823343905
- 1984: min = 0.371212275138373, max = 156.710679601932
- 1985: min = 0.435561503296713, max = 160.035405894265
- 1986: min = 0.523166827140101, max = 166.475451338615
- 1987: min = 0.520063989468076, max = 174.362999734779
- 1988: min = 0.534718518490711, max = 177.23830900656
- 1989: min = 0.854895111669917, max = 183.686675411154
- 1990: min = 7.23783885962536, max = 193.731581562473
- 1991: min = 2.1938560603028, max = 192.239126852806
- 1992: min = 4.08615206518804, max = 182.077478393815
- 1993: min = 4.89445955475539, max = 189.997260642167
- 1994: min = 6.32607530563365, max = 195.465948323022
- 1995: min = 6.32263978698623, max = 13806.7699057663
- 1996: min = 6.03443485654017, max = 12594.9409367459
- 1997: min = 6.19527451266445, max = 19402.82915571
- 1998: min = 5.14320048581012, max = 16777.0737883825
- 1999: min = 5.92621877124759, max = 16339.1004944112
- 2000: min = 3.42308139669861, max = 18470.3381459423
- 2001: min = 2.85740812895541, max = 226.534329092191
- 2002: min = 2.91725803901463, max = 231.958311065905
- 2003: min = 3.16074494763066, max = 254.495800069468
- 2004: min = 5.06281053286604, max = 262.821598558486
- 2005: min = 4.53035061027253, max = 257.688533947066
- 2006: min = 5.92029022328923, max = 276.872062745161
- 2007: min = 6.99603968549037, max = 290.577753041637
- 2008: min = 6.4761202007167, max = 289.017219457601
- 2009: min = 9.71150702222512, max = 315.700700680129
- 2010: min = 10.1163035677052, max = 315.264425681686
- 2011: min = 9.02170278597333, max = 323.843597959089
- 2012: min = 10.7499611405792, max = 336.252603827381
- 2013: min = 11.1999589942499, max = 353.915337054349
- 2014: min = 11.8740496220761, max = 363.365683969726
- 2015: min = 0.0153190301749605, max = 365.587623171606
- 2016: min = 0.0160391997358557, max = 375.038760169244
- 2017: min = 0.0151801010542364, max = 396.15178949102
- 2018: min = 0.0147272954355909, max = 386.098765724693
- 2019: min = 0.0145171549458104, max = 403.314068678555
- 2020: min = 0.0179333991187372, max = 454.703219307323
- 2021: min = 0.0192391992698411, max = 437.798213895319
- 2022: min = 0.0203970347103604, max = 461.607360569545
- 2023: min = 0.0198352929421525, max = 451.722286684234
- Unnamed: 68: min = nan, max = nan
================================================================================
File: ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv
------------------------------------------------------------
Missing Values per Column:
Country Name 0
Country Code 0
Indicator Name 0
Indicator Code 0
1960 262
...
2020 215
2021 216
2022 216
2023 220
Unnamed: 68 266
Length: 69, dtype: int64
Data Types:
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2020 float64
2021 float64
2022 float64
2023 float64
Unnamed: 68 float64
Length: 69, dtype: object
Numeric Column Ranges:
- 1960: min = 17.6605245307143, max = 101.328694267945
- 1961: min = 17.8868432613576, max = 106.71797125053
- 1962: min = 17.2659768639598, max = 106.515365444874
- 1963: min = 17.926408169245, max = 110.602848972855
- 1964: min = 17.442243973229, max = 112.648360941239
- 1965: min = 4.84592908654856, max = 114.296734336246
- 1966: min = 5.91051774524756, max = 109.754159337096
- 1967: min = 6.20937360419641, max = 114.249428305335
- 1968: min = 6.85326352786304, max = 114.654554469888
- 1969: min = 7.58111077438312, max = 110.969223221077
- 1970: min = 8.22327921813116, max = 127.30990330017
- 1971: min = 8.07625191598414, max = 139.959108855898
- 1972: min = 8.87621494453388, max = 152.007382334756
- 1973: min = 9.14302419152581, max = 149.707879101861
- 1974: min = 8.97546955441644, max = 146.042560493966
- 1975: min = -0.431576703621441, max = 155.905961757231
- 1976: min = -2.07556768266504, max = 161.171826855248
- 1977: min = -9.31364181681421, max = 163.53252625863
- 1978: min = -5.44221494686961, max = 169.07818412737
- 1979: min = -1.68501537976286, max = 173.58512350942
- 1980: min = -4.56658522688823, max = 178.69584759831
- 1981: min = -1.50346038986363, max = 184.51480777333
- 1982: min = 9.21344226507459, max = 193.721003558398
- 1983: min = 9.65488907148726, max = 226.90538147139
- 1984: min = 11.8990384615385, max = 253.267633597857
- 1985: min = 8.25743089578077, max = 289.509282051282
- 1986: min = 8.00210990102082, max = 304.060705256254
- 1987: min = 8.5332912715156, max = 251.314899242737
- 1988: min = 9.78508598444181, max = 333.987248283887
- 1989: min = 11.2179045835514, max = 299.061382326895
- 1990: min = 11.9705907497649, max = 265.864312768502
- 1991: min = 11.5829668943736, max = 246.012922986886
- 1992: min = 12.6916002032577, max = 265.764524757136
- 1993: min = 10.8446794241956, max = 262.71214661392
- 1994: min = 10.5220178739894, max = 268.663093607737
- 1995: min = 9.82120710344781, max = 272.574425316829
- 1996: min = 7.20536795597285, max = 275.743373177452
- 1997: min = 6.8467914845147, max = 262.276656191001
- 1998: min = 6.14669975411094, max = 280.827176422477
- 1999: min = 7.34131753454976, max = 292.052964957859
- 2000: min = 8.23428770016307, max = 290.196962110143
- 2001: min = 5.72721185526847, max = 280.530390006151
- 2002: min = 5.89281240891063, max = 288.405464689291
- 2003: min = 6.27864254812847, max = 297.168610975335
- 2004: min = 5.02912519620321, max = 293.608773897134
- 2005: min = 6.6261292424553, max = 294.479711016917
- 2006: min = 6.76624715430518, max = 288.82574179205
- 2007: min = 7.08662692007408, max = 280.759832940743
- 2008: min = 8.37325155631622, max = 287.303812370719
- 2009: min = 15.2772680678067, max = 313.902033061084
- 2010: min = 7.63714397380134, max = 311.661509708809
- 2011: min = 7.16152427927212, max = 320.432229320439
- 2012: min = 10.1795426462862, max = 327.868401739389
- 2013: min = 15.2291570008484, max = 336.965554984602
- 2014: min = 17.8995805418587, max = 344.926765238208
- 2015: min = 19.7296251394019, max = 341.142455573764
- 2016: min = 20.7184218442362, max = 346.832508947128
- 2017: min = 13.9316405221928, max = 352.149390767584
- 2018: min = 13.1328229332882, max = 350.615462584578
- 2019: min = 6.61287108099699, max = 363.317971259922
- 2020: min = 10.5005937233074, max = 393.4512665795
- 2021: min = 10.3735942932122, max = 391.25195404612
- 2022: min = 8.84730639883332, max = 385.902094673653
- 2023: min = 11.3075260651342, max = 380.22941146695
- Unnamed: 68: min = nan, max = nan
================================================================================
File: ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv
------------------------------------------------------------
Missing Values per Column:
Country Name 0
Country Code 0
Indicator Name 0
Indicator Code 0
1960 203
...
2020 57
2021 61
2022 72
2023 93
Unnamed: 68 266
Length: 69, dtype: int64
Data Types:
Country Name object
Country Code object
Indicator Name object
Indicator Code object
1960 float64
...
2020 float64
2021 float64
2022 float64
2023 float64
Unnamed: 68 float64
Length: 69, dtype: object
Numeric Column Ranges:
- 1960: min = 0.544827291383113, max = 87.7884734668407
- 1961: min = 0.466948987073388, max = 88.6652176956245
- 1962: min = 0.453731369700999, max = 89.2011338254893
- 1963: min = 1.23483535528596, max = 89.098030054666
- 1964: min = 0.93846158177515, max = 87.0741961847816
- 1965: min = 1.10215068577294, max = 88.1763437244437
- 1966: min = 1.69811311420267, max = 87.6291865842292
- 1967: min = 1.46778624705344, max = 87.5536183761206
- 1968: min = 1.05166425741106, max = 89.5841007877957
- 1969: min = 1.05566686542523, max = 91.07487346495
- 1970: min = 1.68390804357819, max = 105.969890380202
- 1971: min = 1.9256648624975, max = 117.327586371128
- 1972: min = 2.03656394842516, max = 126.57063036777
- 1973: min = 1.7017680542504, max = 125.366842623993
- 1974: min = 0.629215114674229, max = 119.954549046701
- 1975: min = 0.293244144726216, max = 123.24323278834
- 1976: min = 0.305360806177619, max = 123.877453731257
- 1977: min = 0.354171828827184, max = 121.843474745811
- 1978: min = 0.397637402875322, max = 121.110402009136
- 1979: min = 0.0077257969420291, max = 122.549028727429
- 1980: min = 0.224611106401564, max = 124.506445159423
- 1981: min = 0.234053744253535, max = 127.487155696374
- 1982: min = 0.188218039161905, max = 132.560903355012
- 1983: min = 0.188325374710165, max = 136.967063460906
- 1984: min = 0.194443195682705, max = 139.731267494954
- 1985: min = 0.19803401183335, max = 143.293694125624
- 1986: min = 0.192321691381103, max = 148.983097386027
- 1987: min = 0.201292314197885, max = 164.598028788581
- 1988: min = 0.156348345803423, max = 173.485377848245
- 1989: min = 0.340667608015901, max = 181.906930654757
- 1990: min = 0.962944843883038, max = 186.987678819914
- 1991: min = 0.666042626156598, max = 186.656842054711
- 1992: min = 0.928565191975924, max = 190.373634464532
- 1993: min = 1.21082223551764, max = 193.807386276015
- 1994: min = 1.05685691922687, max = 195.586907517692
- 1995: min = 1.06312014122331, max = 4545.06890490193
- 1996: min = 1.01593947680101, max = 2955.54728166002
- 1997: min = 0.68777945613985, max = 2854.83258441036
- 1998: min = 1.96258997070586, max = 4516.72729611422
- 1999: min = 1.61553114401575, max = 4544.08342043556
- 2000: min = 0.776252724448565, max = 4376.40303334019
- 2001: min = 0.0010316241276931, max = 180.201537127819
- 2002: min = 0.0012966126450079, max = 177.918035565601
- 2003: min = 0.001851528814682, max = 177.536325650048
- 2004: min = 0.0022392943555062, max = 184.805504038329
- 2005: min = 0.0021695303066487, max = 239.561823087085
- 2006: min = 0.0021897618535402, max = 304.575112745362
- 2007: min = 0.0028075827333138, max = 246.609307337934
- 2008: min = 0.0036415345517308, max = 212.209238098344
- 2009: min = 0.0053332999357277, max = 226.355979301383
- 2010: min = 0.0050010907515955, max = 234.2515929319
- 2011: min = 0.0049180494518027, max = 242.541132511585
- 2012: min = 0.0035373264833552, max = 249.218122126626
- 2013: min = 0.003048762229442, max = 254.66806457038
- 2014: min = 0.003125365136198, max = 251.830384825679
- 2015: min = 0.0034161743588991, max = 243.416337343981
- 2016: min = 0.0035423485873362, max = 216.242757371797
- 2017: min = 0.0032884239039179, max = 223.371866412043
- 2018: min = 0.0036406032253703, max = 219.908123250485
- 2019: min = 0.0038596499470339, max = 236.740071570624
- 2020: min = 0.0036197360515618, max = 269.161635493478
- 2021: min = 0.004227162012086, max = 259.183988203515
- 2022: min = 0.0035388613148591, max = 286.55742251628
- 2023: min = 0.0032384320094154, max = 248.758612570897
- Unnamed: 68: min = nan, max = nan
================================================================================
==== Assessing Data Quality for Country-Level Meta Data ====
File: ./data/Metadata_Country_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
------------------------------------------------------------
Missing Values per Column:
"Country Code" 0
Region 48
IncomeGroup 49
SpecialNotes 138
TableName 0
Unnamed: 5 265
dtype: int64
Data Types:
"Country Code" object
Region object
IncomeGroup object
SpecialNotes object
TableName object
Unnamed: 5 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 5: min = nan, max = nan
================================================================================
File: ./data/Metadata_Country_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
------------------------------------------------------------
Missing Values per Column:
"Country Code" 0
Region 48
IncomeGroup 49
SpecialNotes 138
TableName 0
Unnamed: 5 265
dtype: int64
Data Types:
"Country Code" object
Region object
IncomeGroup object
SpecialNotes object
TableName object
Unnamed: 5 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 5: min = nan, max = nan
================================================================================
File: ./data/Metadata_Country_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv
------------------------------------------------------------
Missing Values per Column:
"Country Code" 0
Region 48
IncomeGroup 49
SpecialNotes 138
TableName 0
Unnamed: 5 265
dtype: int64
Data Types:
"Country Code" object
Region object
IncomeGroup object
SpecialNotes object
TableName object
Unnamed: 5 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 5: min = nan, max = nan
================================================================================
File: ./data/Metadata_Country_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv
------------------------------------------------------------
Missing Values per Column:
"Country Code" 0
Region 48
IncomeGroup 49
SpecialNotes 138
TableName 0
Unnamed: 5 265
dtype: int64
Data Types:
"Country Code" object
Region object
IncomeGroup object
SpecialNotes object
TableName object
Unnamed: 5 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 5: min = nan, max = nan
================================================================================
File: ./data/Metadata_Country_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv
------------------------------------------------------------
Missing Values per Column:
"Country Code" 0
Region 48
IncomeGroup 49
SpecialNotes 138
TableName 0
Unnamed: 5 265
dtype: int64
Data Types:
"Country Code" object
Region object
IncomeGroup object
SpecialNotes object
TableName object
Unnamed: 5 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 5: min = nan, max = nan
================================================================================
File: ./data/Metadata_Country_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv
------------------------------------------------------------
Missing Values per Column:
"Country Code" 0
Region 48
IncomeGroup 49
SpecialNotes 138
TableName 0
Unnamed: 5 265
dtype: int64
Data Types:
"Country Code" object
Region object
IncomeGroup object
SpecialNotes object
TableName object
Unnamed: 5 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 5: min = nan, max = nan
================================================================================
==== Assessing Data Quality for Indicator-Level Meta Data ====
File: ./data/Metadata_Indicator_API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv
------------------------------------------------------------
Missing Values per Column:
"INDICATOR_CODE" 0
INDICATOR_NAME 0
SOURCE_NOTE 0
SOURCE_ORGANIZATION 0
Unnamed: 4 1
dtype: int64
Data Types:
"INDICATOR_CODE" object
INDICATOR_NAME object
SOURCE_NOTE object
SOURCE_ORGANIZATION object
Unnamed: 4 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 4: min = nan, max = nan
================================================================================
File: ./data/Metadata_Indicator_API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv
------------------------------------------------------------
Missing Values per Column:
"INDICATOR_CODE" 0
INDICATOR_NAME 0
SOURCE_NOTE 0
SOURCE_ORGANIZATION 0
Unnamed: 4 1
dtype: int64
Data Types:
"INDICATOR_CODE" object
INDICATOR_NAME object
SOURCE_NOTE object
SOURCE_ORGANIZATION object
Unnamed: 4 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 4: min = nan, max = nan
================================================================================
File: ./data/Metadata_Indicator_API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv
------------------------------------------------------------
Missing Values per Column:
"INDICATOR_CODE" 0
INDICATOR_NAME 0
SOURCE_NOTE 0
SOURCE_ORGANIZATION 0
Unnamed: 4 1
dtype: int64
Data Types:
"INDICATOR_CODE" object
INDICATOR_NAME object
SOURCE_NOTE object
SOURCE_ORGANIZATION object
Unnamed: 4 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 4: min = nan, max = nan
================================================================================
File: ./data/Metadata_Indicator_API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv
------------------------------------------------------------
Missing Values per Column:
"INDICATOR_CODE" 0
INDICATOR_NAME 0
SOURCE_NOTE 0
SOURCE_ORGANIZATION 0
Unnamed: 4 1
dtype: int64
Data Types:
"INDICATOR_CODE" object
INDICATOR_NAME object
SOURCE_NOTE object
SOURCE_ORGANIZATION object
Unnamed: 4 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 4: min = nan, max = nan
================================================================================
File: ./data/Metadata_Indicator_API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv
------------------------------------------------------------
Missing Values per Column:
"INDICATOR_CODE" 0
INDICATOR_NAME 0
SOURCE_NOTE 0
SOURCE_ORGANIZATION 0
Unnamed: 4 1
dtype: int64
Data Types:
"INDICATOR_CODE" object
INDICATOR_NAME object
SOURCE_NOTE object
SOURCE_ORGANIZATION object
Unnamed: 4 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 4: min = nan, max = nan
================================================================================
File: ./data/Metadata_Indicator_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv
------------------------------------------------------------
Missing Values per Column:
"INDICATOR_CODE" 0
INDICATOR_NAME 0
SOURCE_NOTE 0
SOURCE_ORGANIZATION 0
Unnamed: 4 1
dtype: int64
Data Types:
"INDICATOR_CODE" object
INDICATOR_NAME object
SOURCE_NOTE object
SOURCE_ORGANIZATION object
Unnamed: 4 float64
dtype: object
Numeric Column Ranges:
- Unnamed: 4: min = nan, max = nan
================================================================================
# For each primary financial indicator file
for file, df in primary_data.items():
print(f"\n==== Univariate Analysis for file: {file} ====")
# Identify numeric columns (indicator values)
numeric_cols = df.select_dtypes(include='number').columns.tolist()
# Remove potential extraneous columns (e.g., the empty 'Unnamed: 68')
if "Unnamed: 68" in numeric_cols:
numeric_cols.remove("Unnamed: 68")
print("Numeric columns available:", numeric_cols)
# For this example, choose the numeric column with the most non-null values
non_null_counts = {col: df[col].count() for col in numeric_cols}
chosen_col = max(non_null_counts, key=non_null_counts.get)
print(f"Chosen column for analysis: {chosen_col}")
# Generate summary statistics using .describe()
stats = df[chosen_col].describe()
mean_val = df[chosen_col].mean()
median_val = df[chosen_col].median()
var_val = df[chosen_col].var()
print("\nSummary Statistics:")
print(stats)
print(f"\nMean: {mean_val}")
print(f"Median: {median_val}")
print(f"Variance: {var_val}")
# Plot a histogram (with KDE) for the chosen column
plt.figure(figsize=(10, 6))
sns.histplot(df[chosen_col].dropna(), kde=True, bins=30)
plt.title(f"Histogram for {chosen_col}\n({file.split('/')[-1]})")
plt.xlabel(chosen_col)
plt.ylabel("Frequency")
plt.show()
# Plot a boxplot for the chosen column
plt.figure(figsize=(6, 4))
sns.boxplot(x=df[chosen_col].dropna())
plt.title(f"Boxplot for {chosen_col}\n({file.split('/')[-1]})")
plt.xlabel(chosen_col)
plt.show()
print("="*80)
==== Univariate Analysis for file: ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv ==== Numeric columns available: ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] Chosen column for analysis: 2014 Summary Statistics: count 240.000000 mean 7.445819 std 30.893137 min -8.267207 25% 1.369289 50% 2.569562 75% 4.418992 max 394.898960 Name: 2014, dtype: float64 Mean: 7.445819478385338 Median: 2.5695618208868654 Variance: 954.3858854612528
================================================================================ ==== Univariate Analysis for file: ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv ==== Numeric columns available: ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] Chosen column for analysis: 2018 Summary Statistics: count 147.000000 mean 7.396732 std 9.006356 min 0.217152 25% 2.237032 50% 3.686425 75% 9.603857 max 54.410017 Name: 2018, dtype: float64 Mean: 7.396732416345353 Median: 3.68642493378735 Variance: 81.11445106829257
================================================================================ ==== Univariate Analysis for file: ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv ==== Numeric columns available: ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] Chosen column for analysis: 2013 Summary Statistics: count 229.000000 mean 50.876700 std 39.468547 min 0.003014 25% 21.328569 50% 41.079745 75% 66.987640 max 254.551659 Name: 2013, dtype: float64 Mean: 50.87670017805866 Median: 41.0797449988366 Variance: 1557.7661831721584
================================================================================ ==== Univariate Analysis for file: ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv ==== Numeric columns available: ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] Chosen column for analysis: 2005 Summary Statistics: count 205.000000 mean 52.636183 std 39.066277 min 4.530351 25% 24.283395 50% 44.516882 75% 69.111826 max 257.688534 Name: 2005, dtype: float64 Mean: 52.636183193941136 Median: 44.5168816295289 Variance: 1526.1739675784306
================================================================================ ==== Univariate Analysis for file: ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv ==== Numeric columns available: ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] Chosen column for analysis: 2017 Summary Statistics: count 53.000000 mean 80.434390 std 62.782078 min 13.931641 25% 41.731155 50% 63.610667 75% 92.534391 max 352.149391 Name: 2017, dtype: float64 Mean: 80.43439027686142 Median: 63.6106665234143 Variance: 3941.5893329497344
================================================================================ ==== Univariate Analysis for file: ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv ==== Numeric columns available: ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'] Chosen column for analysis: 2013 Summary Statistics: count 223.000000 mean 55.567804 std 44.064635 min 0.003049 25% 22.123508 50% 42.770132 75% 74.572703 max 254.668065 Name: 2013, dtype: float64 Mean: 55.56780395104206 Median: 42.7701323168693 Variance: 1941.692042670021
================================================================================
# Define a function to perform univariate analysis on a chosen numeric column
def univariate_analysis(df, col, title=""):
# Print summary statistics
print(f"Summary Statistics for {title} - Column: {col}")
stats = df[col].describe()
mean_val = df[col].mean()
median_val = df[col].median()
var_val = df[col].var()
print(stats)
print(f"Mean: {mean_val}")
print(f"Median: {median_val}")
print(f"Variance: {var_val}\n")
# Create a histogram with KDE
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df[col].dropna(), kde=True, color='skyblue')
plt.title(f"Histogram of {title} - {col}")
plt.xlabel(col)
# Create a boxplot
plt.subplot(1, 2, 2)
sns.boxplot(x=df[col].dropna(), color='lightgreen')
plt.title(f"Boxplot of {title} - {col}")
plt.xlabel(col)
plt.tight_layout()
plt.show()
# Create a dictionary mapping each primary indicator file to the chosen year column
chosen_years = {
"./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv": "2014",
"./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv": "2018",
"./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv": "2013",
"./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv": "2005",
"./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv": "2017",
"./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv": "2013"
}
# Loop over the primary data files and perform univariate analysis
for file, df in primary_data.items():
if file in chosen_years:
col = chosen_years[file]
print("=" * 80)
print(f"==== Univariate Analysis for file: {file} ====")
univariate_analysis(df, col, title=file)
================================================================================ ==== Univariate Analysis for file: ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv ==== Summary Statistics for ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv - Column: 2014 count 240.000000 mean 7.445819 std 30.893137 min -8.267207 25% 1.369289 50% 2.569562 75% 4.418992 max 394.898960 Name: 2014, dtype: float64 Mean: 7.445819478385338 Median: 2.5695618208868654 Variance: 954.3858854612528
================================================================================ ==== Univariate Analysis for file: ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv ==== Summary Statistics for ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv - Column: 2018 count 147.000000 mean 7.396732 std 9.006356 min 0.217152 25% 2.237032 50% 3.686425 75% 9.603857 max 54.410017 Name: 2018, dtype: float64 Mean: 7.396732416345353 Median: 3.68642493378735 Variance: 81.11445106829257
================================================================================ ==== Univariate Analysis for file: ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv ==== Summary Statistics for ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv - Column: 2013 count 229.000000 mean 50.876700 std 39.468547 min 0.003014 25% 21.328569 50% 41.079745 75% 66.987640 max 254.551659 Name: 2013, dtype: float64 Mean: 50.87670017805866 Median: 41.0797449988366 Variance: 1557.7661831721584
================================================================================ ==== Univariate Analysis for file: ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv ==== Summary Statistics for ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv - Column: 2005 count 205.000000 mean 52.636183 std 39.066277 min 4.530351 25% 24.283395 50% 44.516882 75% 69.111826 max 257.688534 Name: 2005, dtype: float64 Mean: 52.636183193941136 Median: 44.5168816295289 Variance: 1526.1739675784306
================================================================================ ==== Univariate Analysis for file: ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv ==== Summary Statistics for ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv - Column: 2017 count 53.000000 mean 80.434390 std 62.782078 min 13.931641 25% 41.731155 50% 63.610667 75% 92.534391 max 352.149391 Name: 2017, dtype: float64 Mean: 80.43439027686142 Median: 63.6106665234143 Variance: 3941.5893329497344
================================================================================ ==== Univariate Analysis for file: ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv ==== Summary Statistics for ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv - Column: 2013 count 223.000000 mean 55.567804 std 44.064635 min 0.003049 25% 22.123508 50% 42.770132 75% 74.572703 max 254.668065 Name: 2013, dtype: float64 Mean: 55.56780395104206 Median: 42.7701323168693 Variance: 1941.692042670021
Univariate Analysis for file: ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv ====
Summary Statistics for ./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv - Column: 2014
Name: 2014, dtype: float64 Mean: 7.445819478385338
Median: 2.5695618208868654 Variance: 954.3858854612528
===============================================================================Univariate Analysis for file: ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv ====
Summary Statistics for ./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv - Column: 2018
Name: 2018, dtype: float64 Mean: 7.396732416345353 Median: 3.68642493378735 Variance: 81.11445106829257
=============================================================================== ==== Univariate Analysis for file: ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv ====
Summary Statistics for ./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv - Column: 2013
Name: 2013, dtype: float64 Mean: 50.87670017805866 Median: 41.0797449988366 Variance: 1557.7661831721584
===============================================================================Univariate Analysis for file: ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv ====
Summary Statistics for ./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv - Column: 2005
Name: 2005, dtype: float64 Mean: 52.636183193941136 Median: 44.5168816295289 Variance: 1526.1739675784306
===============================================================================Univariate Analysis for file: ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv ====
Summary Statistics for ./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv - Column: 2017
Name: 2017, dtype: float64 Mean: 80.43439027686142 Median: 63.6106665234143 Variance: 3941.5893329497344
===============================================================================Univariate Analysis for file: ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv ====
Summary Statistics for ./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv - Column: 2013
Name: 2013, dtype: float64 Mean: 55.56780395104206 Median: 42.7701323168693 Variance: 1941.692042670021
Skewness: Almost all distributions are right-skewed, meaning there are many countries with relatively modest values, but a small number of extremely high values. The mean is consistently greater than the median in these samples, which is a classic sign of right skew.
Outliers: All indicators exhibit notable outliers. For example, FDI net inflows can spike above 100% of GDP, or even approach 400% for certain edge cases. Broad money, domestic credit, and private sector credit also have extreme upper values well above the typical range.
Missing Values: In each file, certain years or columns have a large fraction of missing data. This is expected in cross-country time-series data, as not all countries report consistently.
Variability: Standard deviations and variances are quite high relative to the mean, reinforcing that the data spread is wide and not centered around a single typical value.
In summary, the univariate analysis reveals that each financial indicator is heavily skewed with a handful of outliers, so future steps will likely involve transformations, outlier-handling strategies, and possibly grouping countries by region or income for more context.
# --- Step 1: Extract the chosen year from each primary indicator ---
# (Here we use different years per file as in your example.)
df_fdi = primary_data["./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv"][["Country Name", "Country Code", "2013"]].rename(columns={"2013": "FDI_2013"})
df_npl = primary_data["./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv"][["Country Name", "Country Code", "2018"]].rename(columns={"2018": "NPL_2018"})
df_priv_bank = primary_data["./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv"][["Country Name", "Country Code", "2013"]].rename(columns={"2013": "PrivCreditBanks_2013"})
df_broadmoney = primary_data["./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv"][["Country Name", "Country Code", "2005"]].rename(columns={"2005": "BroadMoney_2005"})
df_dom_credit = primary_data["./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv"][["Country Name", "Country Code", "2017"]].rename(columns={"2017": "DomCredit_FS_2017"})
df_priv_credit = primary_data["./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv"][["Country Name", "Country Code", "2013"]].rename(columns={"2013": "PrivCredit_FS_2013"})
# --- Step 2: Drop duplicate Country Name columns ---
# We will keep the "Country Name" from df_fdi and drop it from the others.
dfs = [df_fdi, df_npl, df_priv_bank, df_broadmoney, df_dom_credit, df_priv_credit]
for i in range(1, len(dfs)):
if "Country Name" in dfs[i].columns:
dfs[i] = dfs[i].drop(columns=["Country Name"])
# --- Step 3: Merge the datasets on 'Country Code' ---
df_merged = dfs[0]
for i in range(1, len(dfs)):
df_merged = df_merged.merge(dfs[i], on="Country Code", how="inner")
# Optional: If you wish to inspect the merged DataFrame:
print("Merged DataFrame Head:")
print(df_merged.head())
# --- Step 4: Bivariate/Multivariate Analysis ---
indicator_cols = ["FDI_2013", "NPL_2018", "PrivCreditBanks_2013", "BroadMoney_2005", "DomCredit_FS_2017", "PrivCredit_FS_2013"]
# Compute and display the correlation matrix
corr_matrix = df_merged[indicator_cols].corr()
print("Correlation Matrix:")
print(corr_matrix)
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix of Financial Indicators")
plt.show()
# Create a pairplot to visualize pairwise relationships
sns.pairplot(df_merged[indicator_cols])
plt.suptitle("Pairplot of Financial Indicators", y=1.02)
plt.show()
Merged DataFrame Head:
Country Name Country Code FDI_2013 NPL_2018 \
0 Aruba ABW 9.974455 NaN
1 Africa Eastern and Southern AFE 2.342134 NaN
2 Afghanistan AFG 0.239801 8.886660
3 Africa Western and Central AFW 2.255601 NaN
4 Angola AGO -5.380131 23.221737
PrivCreditBanks_2013 BroadMoney_2005 DomCredit_FS_2017 \
0 59.163673 57.446790 NaN
1 38.386182 45.229569 NaN
2 4.108256 NaN NaN
3 12.576012 13.594007 NaN
4 21.341167 13.299819 29.796248
PrivCredit_FS_2013
0 59.163673
1 75.700968
2 4.321685
3 12.657947
4 NaN
Correlation Matrix:
FDI_2013 NPL_2018 PrivCreditBanks_2013 \
FDI_2013 1.000000 0.091327 0.305682
NPL_2018 0.091327 1.000000 -0.172562
PrivCreditBanks_2013 0.305682 -0.172562 1.000000
BroadMoney_2005 0.073034 -0.255719 0.751176
DomCredit_FS_2017 -0.188027 -0.231935 0.707298
PrivCredit_FS_2013 0.254977 -0.180352 0.931215
BroadMoney_2005 DomCredit_FS_2017 PrivCredit_FS_2013
FDI_2013 0.073034 -0.188027 0.254977
NPL_2018 -0.255719 -0.231935 -0.180352
PrivCreditBanks_2013 0.751176 0.707298 0.931215
BroadMoney_2005 1.000000 0.814145 0.743596
DomCredit_FS_2017 0.814145 1.000000 0.926731
PrivCredit_FS_2013 0.743596 0.926731 1.000000
C:\ProgramData\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
A correlation matrix across six key columns was computed:
FDI_2013: Foreign direct investment, net inflows (% of GDP) for 2013
NPL_2018: Bank nonperforming loans to total gross loans (%) for 2018
PrivCreditBanks_2013: Domestic credit to private sector by banks (% of GDP) for 2013
BroadMoney_2005: Broad money (% of GDP) for 2005
DomCredit_FS_2017: Domestic credit provided by financial sector (% of GDP) for 2017
PrivCredit_FS_2013: Domestic credit to private sector (% of GDP) for 2013
Key Observations
High Inter-Correlation Among Credit Indicators PrivCreditBanks_2013, DomCredit_FS_2017 and PrivCredit_FS_2013 all show strong positive correlations (above 0.70–0.90).
BroadMoney_2005 BroadMoney_2005 also correlates positively with these credit-based measures (0.74–0.81 range), indicating that economies with higher money supply also tend to have higher levels of domestic/private credit.
NPL_2018 Tends to Have Negative Correlation Nonperforming loans (%) in 2018 shows a moderate negative correlation with the credit indicators (ranging from -0.17 to -0.26). This might suggest that economies with higher NPL ratios do not necessarily sustain high levels of private or domestic credit. However, the correlation is not extremely strong, so further investigation is needed.
FDI_2013 Has Only Weak to Moderate Positive Correlations
FDI_2013’s correlations with the credit indicators are moderate (0.25–0.31) or near zero. This implies FDI inflows in 2013 may not strongly align with a country’s domestic credit or money supply levels.
Potential Outliers: Some countries exhibit extremely high values in certain indicators (e.g., domestic credit well above 200% of GDP), which appear in the correlation matrix and can be seen in boxplots. These outliers may need special handling (e.g., capping, transformation, or separate analysis).
From the pairplot (scatter plot matrix):
Many distributions (histograms along the diagonal) show right-skewed shapes. Certain indicators (e.g., domestic/private credit) have large outliers far to the right.
FDI also has some outliers in the higher ranges (hundreds of % of GDP for certain entries).
Pairwise Relationships
The scatter plots for the three credit measures confirm the high clustering and strong positive slope.
FDI vs. credit indicators show broad scatter, with no tight linear trend.
NPL vs. credit measures show a slightly downward trend in the scatter plots, consistent with the negative correlations.
# Define a function that extracts the time series for all years (i.e. numeric columns)
# for a set of sample countries and then plots them.
def plot_time_series_for_indicator(file_key, indicator_label, data_dict, countries):
"""
Plots a time series line chart for the chosen financial indicator.
Parameters:
file_key (str): key (file path) for the indicator's DataFrame in data_dict.
indicator_label (str): Label to use for the y-axis.
data_dict (dict): Dictionary containing the loaded DataFrames.
countries (list): List of country codes to include in the plot.
"""
df = data_dict[file_key]
# Identify year columns (assumed to be columns that consist only of digits)
year_cols = [col for col in df.columns if col.isdigit()]
if not year_cols:
print(f"No year columns found in {file_key}.")
return
# Filter rows for the sample countries based on 'Country Code'
df_sample = df[df["Country Code"].isin(countries)]
# We assume that "Country Name" is present; set it as the index.
# (If you prefer to use 'Country Code', adjust accordingly.)
df_sample = df_sample[["Country Name"] + year_cols].set_index("Country Name")
# Ensure the columns are sorted in chronological order.
sorted_years = sorted(year_cols, key=int)
df_sample = df_sample[sorted_years]
# Convert column names to integers for plotting purposes.
years = list(map(int, sorted_years))
plt.figure(figsize=(10, 6))
for country, row in df_sample.iterrows():
plt.plot(years, row, marker='o', label=country)
plt.xlabel("Year")
plt.ylabel(indicator_label)
plt.title(f"Time Series Analysis: {indicator_label}")
plt.legend()
plt.tight_layout()
plt.show()
# Define a sample list of countries by their "Country Code" (adjust as needed)
sample_countries = ["USA", "JPN", "IND", "CHN", "DEU"]
# Now, call the function for each primary indicator file.
# (Be sure that your 'primary_data' dictionary contains the DataFrames loaded previously.)
# 1. Foreign direct investment, net inflows (% of GDP)
plot_time_series_for_indicator(
file_key="./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv",
indicator_label="FDI (% of GDP)",
data_dict=primary_data,
countries=sample_countries
)
# 2. Bank nonperforming loans to total gross loans (%)
plot_time_series_for_indicator(
file_key="./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv",
indicator_label="Bank NPL Ratio (%)",
data_dict=primary_data,
countries=sample_countries
)
# 3. Domestic credit to private sector by banks (% of GDP)
plot_time_series_for_indicator(
file_key="./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv",
indicator_label="Credit to Private Sector by Banks (% of GDP)",
data_dict=primary_data,
countries=sample_countries
)
# 4. Broad money (% of GDP)
plot_time_series_for_indicator(
file_key="./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv",
indicator_label="Broad Money (% of GDP)",
data_dict=primary_data,
countries=sample_countries
)
# 5. Domestic credit provided by the financial sector (% of GDP)
plot_time_series_for_indicator(
file_key="./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv",
indicator_label="Domestic Credit Provided by Financial Sector (% of GDP)",
data_dict=primary_data,
countries=sample_countries
)
# 6. Domestic credit to private sector (% of GDP) (from financial sector)
plot_time_series_for_indicator(
file_key="./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv",
indicator_label="Domestic Credit to Private Sector (% of GDP)",
data_dict=primary_data,
countries=sample_countries
)
Trends Over Time: Time-series plots for each financial indicator (e.g., FDI, nonperforming loans, private sector credit, broad money, domestic credit) reveal that values change significantly over the years. Some indicators display low and relatively stable levels for long periods, while others show periods of rapid increase or decline.
Country Differences: When comparing a sample of countries, you can see that trends differ widely by country. For instance, certain economies experience sharp increases in FDI (% of GDP) and credit indicators in recent years, while others remain more volatile or exhibit flatter trends.
Outliers and Structural Breaks: The visualizations highlight outliers and abrupt shifts in the data. These anomalies may indicate significant economic events or differences in reporting standards that warrant further investigation.
Correlation Insights: Correlation analysis (via heatmaps and pairplots) between the indicators supports the visual findings, suggesting that some financial measures (e.g., domestic credit and private sector credit) are strongly interrelated, whereas others show weaker relationships.
# =============================================================================
# 1) LOAD PRIMARY INDICATOR DATA
# =============================================================================
# Extract and rename the chosen year column from each file:
df_fdi = (
primary_data["./data/API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv"]
[["Country Name", "Country Code", "2013"]]
.rename(columns={"2013": "FDI_2013"})
)
# For subsequent DataFrames, drop "Country Name" to avoid duplicates:
df_npl = (
primary_data["./data/API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv"]
[["Country Name", "Country Code", "2018"]]
.rename(columns={"2018": "NPL_2018"})
.drop(columns=["Country Name"])
)
df_priv_bank = (
primary_data["./data/API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv"]
[["Country Name", "Country Code", "2013"]]
.rename(columns={"2013": "PrivCreditBanks_2013"})
.drop(columns=["Country Name"])
)
df_broadmoney = (
primary_data["./data/API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv"]
[["Country Name", "Country Code", "2005"]]
.rename(columns={"2005": "BroadMoney_2005"})
.drop(columns=["Country Name"])
)
df_dom_credit = (
primary_data["./data/API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv"]
[["Country Name", "Country Code", "2017"]]
.rename(columns={"2017": "DomCredit_FS_2017"})
.drop(columns=["Country Name"])
)
df_priv_credit = (
primary_data["./data/API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv"]
[["Country Name", "Country Code", "2013"]]
.rename(columns={"2013": "PrivCredit_FS_2013"})
.drop(columns=["Country Name"])
)
# =============================================================================
# 2) MERGE PRIMARY INDICATOR DATA
# =============================================================================
df_primary = (
df_fdi
.merge(df_npl, on="Country Code", how="outer")
.merge(df_priv_bank, on="Country Code", how="outer")
.merge(df_broadmoney, on="Country Code", how="outer")
.merge(df_dom_credit, on="Country Code", how="outer")
.merge(df_priv_credit, on="Country Code", how="outer")
)
print("=== After merging primary indicators ===")
print(df_primary.head())
# =============================================================================
# 3) LOAD COUNTRY-LEVEL META DATA (Region & IncomeGroup)
# =============================================================================
# Pick a metadata file that includes "Region" and "IncomeGroup"
country_meta_file = "./data/Metadata_Country_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv"
df_country_meta = pd.read_csv(country_meta_file, encoding="latin1")
# Clean column names to remove hidden/BOM characters
df_country_meta.columns = (
df_country_meta.columns
.str.strip()
.str.replace('"', '', regex=False) # remove literal double-quotes if present
.str.replace('', '', regex=False) # remove BOM-like sequences
)
print("\n=== Country meta columns ===")
print(df_country_meta.columns.tolist())
print("\n=== Sample of df_country_meta ===")
print(df_country_meta.head())
# Keep only the necessary columns
df_country_meta = df_country_meta[["Country Code", "Region", "IncomeGroup"]]
# =============================================================================
# 4) MERGE COUNTRY METADATA INTO PRIMARY DATA
# =============================================================================
df_merged = df_primary.merge(df_country_meta, on="Country Code", how="left")
print("\n=== Columns in df_merged ===")
print(df_merged.columns.tolist())
# =============================================================================
# 5) STRATIFIED ANALYSIS: Group-level Summary Statistics
# =============================================================================
indicator_cols = [
"FDI_2013",
"NPL_2018",
"PrivCreditBanks_2013",
"BroadMoney_2005",
"DomCredit_FS_2017",
"PrivCredit_FS_2013"
]
# (A) Compute means by Region
region_means = df_merged.groupby("Region")[indicator_cols].mean(numeric_only=True)
print("\nMean values by Region:\n", region_means, "\n")
# (B) Compute means by Income Group
income_means = df_merged.groupby("IncomeGroup")[indicator_cols].mean(numeric_only=True)
print("Mean values by Income Group:\n", income_means, "\n")
# ============================================================================
# 6) VISUALIZATION: Boxplots for All Indicators by Region & IncomeGroup
# =============================================================================
# Optionally, create a dict for descriptive labels
indicator_labels = {
"FDI_2013": "FDI (% of GDP)",
"NPL_2018": "Bank Nonperforming Loans (%)",
"PrivCreditBanks_2013": "Credit to Private Sector by Banks (% of GDP)",
"BroadMoney_2005": "Broad Money (% of GDP)",
"DomCredit_FS_2017": "Domestic Credit by Financial Sector (% of GDP)",
"PrivCredit_FS_2013": "Domestic Credit to Private Sector (% of GDP)"
}
# Loop over each indicator and produce two boxplots (Region & IncomeGroup)
for col in indicator_cols:
label = indicator_labels.get(col, col) # fallback to column name if not in dict
# --- Boxplot by Region ---
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_merged, x="Region", y=col)
plt.title(f"{label} by Region")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# --- Boxplot by Income Group ---
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_merged, x="IncomeGroup", y=col)
plt.title(f"{label} by Income Group")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
=== After merging primary indicators ===
Country Name Country Code FDI_2013 NPL_2018 \
0 Aruba ABW 9.974455 NaN
1 Africa Eastern and Southern AFE 2.342134 NaN
2 Afghanistan AFG 0.239801 8.886660
3 Africa Western and Central AFW 2.255601 NaN
4 Angola AGO -5.380131 23.221737
PrivCreditBanks_2013 BroadMoney_2005 DomCredit_FS_2017 \
0 59.163673 57.446790 NaN
1 38.386182 45.229569 NaN
2 4.108256 NaN NaN
3 12.576012 13.594007 NaN
4 21.341167 13.299819 29.796248
PrivCredit_FS_2013
0 59.163673
1 75.700968
2 4.321685
3 12.657947
4 NaN
=== Country meta columns ===
['Country Code', 'Region', 'IncomeGroup', 'SpecialNotes', 'TableName', 'Unnamed: 5']
=== Sample of df_country_meta ===
Country Code Region IncomeGroup \
0 ABW Latin America & Caribbean High income
1 AFE NaN NaN
2 AFG South Asia Low income
3 AFW NaN NaN
4 AGO Sub-Saharan Africa Lower middle income
SpecialNotes \
0 NaN
1 26 countries, stretching from the Red Sea in t...
2 The reporting period for national accounts dat...
3 22 countries, stretching from the westernmost ...
4 The World Bank systematically assesses the app...
TableName Unnamed: 5
0 Aruba NaN
1 Africa Eastern and Southern NaN
2 Afghanistan NaN
3 Africa Western and Central NaN
4 Angola NaN
=== Columns in df_merged ===
['Country Name', 'Country Code', 'FDI_2013', 'NPL_2018', 'PrivCreditBanks_2013', 'BroadMoney_2005', 'DomCredit_FS_2017', 'PrivCredit_FS_2013', 'Region', 'IncomeGroup']
Mean values by Region:
FDI_2013 NPL_2018 PrivCreditBanks_2013 \
Region
East Asia & Pacific 6.306956 2.861994 72.458370
Europe & Central Asia 7.501493 8.198984 74.451740
Latin America & Caribbean 16.992717 5.286834 44.026847
Middle East & North Africa 2.864386 7.369262 45.253572
North America 2.061941 0.709188 49.080475
South Asia 2.194278 7.473174 33.389755
Sub-Saharan Africa 5.929472 11.718951 19.986114
BroadMoney_2005 DomCredit_FS_2017 \
Region
East Asia & Pacific 77.247698 109.845529
Europe & Central Asia 44.236042 56.503162
Latin America & Caribbean 53.772776 63.941754
Middle East & North Africa 68.650318 105.744479
North America 110.500548 244.294701
South Asia 49.634166 65.257993
Sub-Saharan Africa 27.357474 63.339612
PrivCredit_FS_2013
Region
East Asia & Pacific 76.772835
Europe & Central Asia 77.628225
Latin America & Caribbean 45.846122
Middle East & North Africa 46.711519
North America 184.162922
South Asia 33.793817
Sub-Saharan Africa 21.410133
Mean values by Income Group:
FDI_2013 NPL_2018 PrivCreditBanks_2013 \
IncomeGroup
High income 13.662423 5.699639 83.731705
Low income 8.620511 9.408371 12.095209
Lower middle income 3.673075 9.213829 30.166656
Upper middle income 4.567154 7.524223 45.289449
BroadMoney_2005 DomCredit_FS_2017 PrivCredit_FS_2013
IncomeGroup
High income 79.631106 129.428388 87.893176
Low income 25.154259 21.638508 13.066887
Lower middle income 43.579961 58.878699 32.459924
Upper middle income 46.519551 65.311052 49.470033
After merging the primary financial indicators—FDI, bank nonperforming loans, credit to the private sector by banks, broad money, domestic credit by the financial sector, and domestic credit to the private sector—with country-level metadata (Region and Income Group), we computed group-level summary statistics to assess differences across regions and income groups. We then visualized these differences via boxplots for each indicator.
FDI (% of GDP):
The mean FDI value is notably higher in Latin America & Caribbean (about 17%) compared to East Asia & Pacific (around 6%). Sub-Saharan Africa also exhibits relatively high variation, with some outliers reaching above 40%. These patterns are visible in the boxplot of FDI by region, where outliers highlight the diversity of FDI inflows within and across regions.
Bank Nonperforming Loans (%):
Sub-Saharan Africa stands out with a higher mean NPL ratio (above 11%), whereas North America shows a much lower average (below 1%). The boxplot by region confirms that some African countries experience significantly elevated NPL ratios, potentially reflecting heightened credit risk.
Credit to Private Sector by Banks (% of GDP):
East Asia & Pacific and Europe & Central Asia both show higher average bank credit to the private sector (over 70%), whereas Sub-Saharan Africa remains comparatively low (around 20%). The boxplots also highlight a few outliers—e.g., certain high-performer countries in East Asia or Europe that exceed 100%.
Broad Money (% of GDP):
North America again ranks high (over 110%), while East Asia & Pacific also exhibits a wide range with some countries surpassing 150%. Regions such as South Asia and Sub-Saharan Africa display lower median values but still have occasional outliers that exceed the group norm.
Domestic Credit by the Financial Sector (% of GDP):
Although not every region has complete data, North America reports the highest mean (over 240%), with East Asia & Pacific also showing substantial variation. Some outliers in East Asia approach or exceed 300%, illustrating how certain financial systems in that region provide significant domestic credit.
Domestic Credit to Private Sector (% of GDP):
East Asia & Pacific and Europe & Central Asia tend to have higher averages, often above 70–80%, while Sub-Saharan Africa remains lower (around 21%). The boxplot by region reveals substantial dispersion, with a few countries in East Asia far exceeding 100%.
Overall, these regional comparisons suggest that economic structures, regulatory environments, and financial development levels vary considerably by geography, influencing how much credit is extended, how many loans become nonperforming, and how attractive FDI inflows can be.
High-Income vs. Low-Income Patterns:
FDI: High-income countries average about 13.7%, whereas low-income countries show around 8.6%.
Credit to Private Sector by Banks: High-income countries reach nearly 84%, but low-income countries sit around 12%.
Broad Money: High-income economies surpass 79% on average, while low-income economies remain around 25%.
Domestic Credit (Financial Sector & Private Sector): Again, high-income groups lead with significantly higher means, illustrating more developed financial systems.
Middle-Income Groups:
Lower Middle Income typically ranges between the low- and high-income extremes. For example, credit to private sector by banks is about 30%—well above the low-income figure yet still far below high-income countries.
Upper Middle Income is often closer to high-income levels (e.g., 45% for credit to private sector by banks), though still showing gaps in financial depth compared to the highest-income category.
Boxplots for each indicator by income group confirm these patterns: high-income countries consistently exhibit higher medians (and occasional outliers well above 100–200%), while low-income nations cluster at lower levels, sometimes with a few outliers that deviate from the group norm.
The stratified analysis reveals significant differences in financial indicators when segmented by both region and income group:
Regional Variation:
Geographic factors, historical development, and local economic policies can drive substantial differences in FDI inflows, bank lending practices, and the overall depth of financial intermediation.
Income-Level Disparities:
High-income economies generally have more advanced financial sectors—e.g., higher credit to the private sector and broad money relative to GDP—indicating deeper financial markets. Meanwhile, low-income countries face challenges like higher NPL ratios or lower domestic credit availability.
These findings underscore how macroeconomic context and development levels may influence financial performance and stability. Future investigations might focus on the specific drivers behind these differences—such as institutional quality, openness to trade, or regulatory frameworks—and how they impact the observed outliers. Ultimately, tailoring policies or investment strategies to align with regional and income-level nuances can help stakeholders navigate the varying financial landscapes worldwide.
Financial indicators like FDI (Foreign Direct Investment) and credit to the private sector vary widely between regions and income levels.
High-income countries generally have more developed financial systems, meaning they have higher credit-to-GDP ratios and a larger money supply.
In contrast, low-income countries usually have lower levels of these financial indicators.
Latin America & Caribbean shows higher average FDI, but some East Asian and Pacific countries also receive significant FDI inflows.
Sub-Saharan Africa tends to have higher levels of nonperforming loans (NPLs) and lower domestic credit.
North America has fewer data points, but consistently shows strong financial indicators, such as broad money supply and private sector credit, suggesting a mature financial system.
High-income countries generally attract more FDI, have more private sector credit, and hold a larger portion of GDP as broad money.
Low-income countries often have lower financial indicators, with a few exceptions where some countries attract higher FDI.
Middle-income countries (both lower and upper) fall somewhere in between—better than low-income but not as advanced as high-income countries in terms of financial development.
In some countries, there is a steady increase in credit or broad money over the years, while others show more fluctuation, especially in FDI.
Certain global events (like economic recessions) may appear as drops in FDI or spikes in NPL ratios.
Some country-year combinations lack data, especially from earlier years or specific indicators. This could lead to misleading averages if not properly managed.
Certain regions, like North America, have limited data points, making it hard to get a complete picture of the region.
Some data points stand out, like FDI exceeding 400% for a few countries, or credit metrics going above 200-300%. These might indicate unique economic situations or errors that need checking.
Some codes in the dataset represent groups of countries rather than single countries (e.g., "AFE" for Africa Eastern and Southern). We need to decide whether to include or exclude these when analyzing individual countries.
Some countries use multiple exchange rates or report data based on different calendars (fiscal vs. calendar year), which could affect comparisons over time.
Sub-Saharan Africa, in particular, has notes about using different exchange rates, which might impact the accuracy of FDI and credit data.
Choose a method for handling missing data, like filling gaps, removing incomplete rows, or using averages.
Review outliers to see if they are valid economic data or just mistakes. Consider using techniques like winsorization to reduce the effect of extreme values.
Check whether grouped country codes (like "AFE") should be treated separately or excluded.
Adjust for countries that use fiscal-year reporting by making sure the data aligns with other countries using calendar years.
Look for patterns between indicators, like how credit to the private sector might relate to NPL ratios.
Explore time-based trends to see if any countries have sudden changes in FDI or NPLs that might need further investigation.
Think about using weighted averages for global or regional comparisons, possibly using GDP or population size as weights.
Based on the comprehensive Data Understanding (which includes exploring data sources, assessing data quality, and performing initial EDA), the following tasks to prepare the data for modeling and further analysis.
In this step, we will focus on assessing and addressing missing data in our analysis. We will work primarily with the merged dataset (df_merged), which combines the six primary financial indicator files and the country-level metadata file. These source files are:
Primary Indicator Files:
API_BX.KLT.DINV.WD.GD.ZS_DS2_en_csv_v2_76268.csv (Foreign Direct Investment – FDI)
API_FB.AST.NPER.ZS_DS2_en_csv_v2_76286.csv (Bank Nonperforming Loans – NPL)
API_FD.AST.PRVT.GD.ZS_DS2_en_csv_v2_14604.csv (Credit to Private Sector by Banks)
API_FM.LBL.BMNY.GD.ZS_DS2_en_csv_v2_14609.csv (Broad Money)
API_FS.AST.DOMS.GD.ZS_DS2_en_csv_v2_13269.csv (Domestic Credit by the Financial Sector)
API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv (Domestic Credit to Private Sector)
Country Metadata File:
Metadata_Country_API_FS.AST.PRVT.GD.ZS_DS2_en_csv_v2_75956.csv (which contains information on Region and IncomeGroup)All the primary indicator data have been merged into a single DataFrame (df_primary), and then the country-level metadata were merged to form df_merged. This merged DataFrame is our main working dataset for data cleaning, exploration, and later modeling.
# Check missing values for the merged data
missing_summary = df_merged.isnull().sum()
print("Missing values in df_merged:\n", missing_summary)
Missing values in df_merged: Country Name 0 Country Code 0 FDI_2013 31 NPL_2018 119 PrivCreditBanks_2013 37 BroadMoney_2005 61 DomCredit_FS_2017 213 PrivCredit_FS_2013 43 Region 49 IncomeGroup 50 dtype: int64
# Visualize missing data using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df_merged.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Data Heatmap for Merged Dataset")
plt.show()
From the counts, we see the following missing values in a dataset of (likely) 266 rows:
These percentages can help guide whether to drop or impute columns or rows. For instance, DomCredit_FS_2017 is missing data for around 80% of the rows, which is very high;
A. DomCredit_FS_2017 (80% missing) Because 4 out of 5 rows have no data for this indicator, any statistical analysis including this column would be severely limited.
B. NPL_2018 (45% missing) Missingness is substantial but not as extreme as DomCredit_FS_2017. If bank nonperforming loans are key to the analysis, it is possible:
Attempt imputation for moderate levels of missing data (e.g., median-based if the distribution is skewed).
Restrict analyses involving NPL to the subset of countries with valid data, dropping those rows for NPL-related tasks on
C. BroadMoney_2005 (23% missing), FDI_2013 (12%), PrivCreditBanks_2013 (14%), PrivCredit_FS_2013 (16%) These columns have missingness levels between 10% and 25%, which is still manageable.
Imputation or row-dropping can be considered on a case-by-case basis:
If a relatively small fraction of your final analytical sample is acceptable to lose, row-wise deletion might be simpler.
If preserving sample size is paramount, consider imputing numeric fields with robust statistics (e.g., median or possibly multiple imputation if you need more sophisticated methods).
D. Region (18% missing) and IncomeGroup (19% missing) These are categorical. For 18–19% missing, it is possible to:
Assign a placeholder category, e.g., "Unknown" or "Not Specified" to avoid losing data for those countries.
If these categories are crucial for stratified analysis, dropping 18–19% of the dataset might be too large a sacrifice. A placeholder label is usually safer.
Key Business Questions:
Implication:
FDI_2013, BroadMoney_2005, PrivCreditBanks_2013, PrivCredit_FS_2013 (~10–25% missing)
Important for Q1–Q4.
Decision: Impute with median or advanced methods; or drop rows with missing values if the sample remains acceptable.
"Unknown" to missing categories to avoid losing entire rows.threshold_missing = 0.80 * len(df_merged) # 80% threshold
if df_merged['DomCredit_FS_2017'].isnull().sum() > threshold_missing:
df_merged.drop(columns=['DomCredit_FS_2017'], inplace=True)
print("Dropped 'DomCredit_FS_2017' due to excessive missingness.")
else:
print("'DomCredit_FS_2017' retained; consider advanced imputation.")
Dropped 'DomCredit_FS_2017' due to excessive missingness.
npl_missing_ratio = df_merged['NPL_2018'].isnull().mean()
print(f"NPL_2018 missing ratio: {npl_missing_ratio:.2%}")
retain_npl = True # or False, depending on your focus
if retain_npl:
print("Retaining NPL_2018 for analysis. Will consider imputation or partial data usage.")
else:
df_merged.drop(columns=['NPL_2018'], inplace=True)
print("Dropped 'NPL_2018' due to high missingness and lesser relevance.")
NPL_2018 missing ratio: 44.74% Retaining NPL_2018 for analysis. Will consider imputation or partial data usage.
indicators_to_impute = ['FDI_2013', 'BroadMoney_2005', 'PrivCreditBanks_2013', 'PrivCredit_FS_2013']
for col in indicators_to_impute:
missing_ratio = df_merged[col].isnull().mean()
if missing_ratio > 0:
median_val = df_merged[col].median(skipna=True)
df_merged[col].fillna(median_val, inplace=True)
print(f"Imputed missing values in {col} with median = {median_val:.2f}. Missing ratio was {missing_ratio:.2%}")
else:
print(f"No missing values in {col}; no imputation needed.")
Imputed missing values in FDI_2013 with median = 2.88. Missing ratio was 11.65% Imputed missing values in BroadMoney_2005 with median = 44.52. Missing ratio was 22.93% Imputed missing values in PrivCreditBanks_2013 with median = 41.08. Missing ratio was 13.91% Imputed missing values in PrivCredit_FS_2013 with median = 42.77. Missing ratio was 16.17%
for cat_col in ['Region', 'IncomeGroup']:
missing_ratio = df_merged[cat_col].isnull().mean()
if missing_ratio > 0:
df_merged[cat_col].fillna("Unknown", inplace=True)
print(f"Assigned 'Unknown' to missing values in {cat_col} ({missing_ratio:.2%} missing).")
else:
print(f"No missing values in {cat_col}.")
Assigned 'Unknown' to missing values in Region (18.42% missing). Assigned 'Unknown' to missing values in IncomeGroup (18.80% missing).
After analyzing the missing values in our merged dataset, we implemented the following steps based on our key business questions (financial indicators, regional/income-group comparisons, policy insights, etc.):
DomCredit_FS_2017¶Reason: Over 80% of its rows were missing (213 out of 266). This excessive missingness severely limited its usefulness for analysis or modeling.
Decision: Remove the column to maintain a cleaner dataset rather than attempting to impute such a large gap.
NPL_2018 (Nonperforming Loans Ratio)¶Reason: Although NPL_2018 had a missing ratio of ~45%, it is crucial for understanding financial stability (one of our key questions).
Decision: Keep it for policy/stability insights, with the possibility of either imputation or restricting some analyses to countries where this data is available.
Indicators: FDI_2013, BroadMoney_2005, PrivCreditBanks_2013, PrivCredit_FS_2013
Missing Ratios: Ranged from ~10–25%.
Method: Median imputation.
Rationale:
These are central to measuring economic performance, cross-country financial development, and policy outcomes.
We chose median imputation to reduce the impact of extreme outliers.
Result: Missing data in these columns is now filled with a robust central tendency measure.
"Unknown" for Missing Region & IncomeGroup¶Reason: Both had missingness around 18–19%.
Decision: Instead of dropping these rows, we set them to "Unknown" to preserve other indicator data for those observations. This ensures we don’t lose entire rows that might still have valuable numeric indicators for other analyses.
Indicators Influencing Economic Performance: We kept and imputed essential columns (FDI, credit measures) to avoid losing critical data.
Differences Across Regions & Income Groups: We preserved rows with missing categorical data by labeling them "Unknown", maintaining maximum coverage for comparisons.
Predictive Analysis: Retaining as many relevant indicators as possible (with median imputation) supports future modeling.
Policy Measures for Stability: We specifically kept NPL_2018 to examine financial-stability issues, despite high missingness.
Duplicate entries can occur after merging multiple datasets, especially if the same Country Code appears more than once with identical or near-identical rows.
duplicate_count = df_merged.duplicated(subset="Country Code", keep=False).sum()
print(f"Number of potential duplicate rows: {duplicate_count}")
# Optionally display the duplicated rows for inspection
if duplicate_count > 0:
duplicates_df = df_merged[df_merged.duplicated(subset="Country Code", keep=False)]
display(duplicates_df)
Number of potential duplicate rows: 0
The result indicates that there are no duplicate records in our merged dataset.
To ensure that our financial indicators are suitable for analysis, we need to verify that all numeric indicator columns are stored as numeric data types (e.g., float64). This is important for accurate calculations, statistical analysis, and visualizations.
# --- Step 1: Review Data Types Before Conversion ---
print("Data types before conversion:")
print(df_merged.dtypes)
# --- Step 2: Convert Financial Indicator Columns to Numeric ---
# List of columns that should be numeric
numeric_cols = [
"FDI_2013", "NPL_2018", "PrivCreditBanks_2013",
"BroadMoney_2005", "PrivCredit_FS_2013"
]
# Convert each column to numeric (if not already), coercing errors to NaN
for col in numeric_cols:
df_merged[col] = pd.to_numeric(df_merged[col], errors='coerce')
# --- Step 3: Verify Data Types After Conversion ---
print("\nData types after conversion:")
print(df_merged.dtypes)
Data types before conversion: Country Name object Country Code object FDI_2013 float64 NPL_2018 float64 PrivCreditBanks_2013 float64 BroadMoney_2005 float64 PrivCredit_FS_2013 float64 Region object IncomeGroup object dtype: object Data types after conversion: Country Name object Country Code object FDI_2013 float64 NPL_2018 float64 PrivCreditBanks_2013 float64 BroadMoney_2005 float64 PrivCredit_FS_2013 float64 Region object IncomeGroup object dtype: object
Before Conversion:
FDI_2013, NPL_2018, PrivCreditBanks_2013, BroadMoney_2005, and PrivCredit_FS_2013 were already of type float64.After Conversion:
float64), confirming that all our numerical indicators are properly formatted for further analysis.Conclusion:
float64, we can now proceed with robust quantitative analyses and visualizations, confident that our calculations will not be affected by data type inconsistencies.We verified that the categorical fields in our merged dataset are correctly formatted as string (object) types. The key categorical fields are:
print(df_merged[['Country Name', 'Country Code', 'Region', 'IncomeGroup']].dtypes)
Country Name object Country Code object Region object IncomeGroup object dtype: object
This output confirms that all our categorical fields are correctly stored as objects. This ensures that our stratified analyses—such as grouping by Region or IncomeGroup—will work as expected.
Note: If desired for memory efficiency or faster processing during categorical operations, these columns can be converted to the category dtype. For our current analysis, however, the object type is sufficient.
Conclusion: The categorical fields are properly formatted as strings, ensuring that our analyses involving group-level summary statistics and visualizations by region and income group can proceed without data type issues.
Ensuring that the country codes are consistent across our dataset is critical because they serve as the unique identifier for merging and subsequent analyses. Inconsistent formatting (e.g., mixed case, extra whitespace) can lead to issues when joining datasets or grouping by country.
# Standardize 'Country Code' in the merged DataFrame
df_merged["Country Code"] = df_merged["Country Code"].str.upper().str.strip()
# Verify the standardized country codes
print("Unique Country Codes:")
print(df_merged["Country Code"].unique())
Unique Country Codes: ['ABW' 'AFE' 'AFG' 'AFW' 'AGO' 'ALB' 'AND' 'ARB' 'ARE' 'ARG' 'ARM' 'ASM' 'ATG' 'AUS' 'AUT' 'AZE' 'BDI' 'BEL' 'BEN' 'BFA' 'BGD' 'BGR' 'BHR' 'BHS' 'BIH' 'BLR' 'BLZ' 'BMU' 'BOL' 'BRA' 'BRB' 'BRN' 'BTN' 'BWA' 'CAF' 'CAN' 'CEB' 'CHE' 'CHI' 'CHL' 'CHN' 'CIV' 'CMR' 'COD' 'COG' 'COL' 'COM' 'CPV' 'CRI' 'CSS' 'CUB' 'CUW' 'CYM' 'CYP' 'CZE' 'DEU' 'DJI' 'DMA' 'DNK' 'DOM' 'DZA' 'EAP' 'EAR' 'EAS' 'ECA' 'ECS' 'ECU' 'EGY' 'EMU' 'ERI' 'ESP' 'EST' 'ETH' 'EUU' 'FCS' 'FIN' 'FJI' 'FRA' 'FRO' 'FSM' 'GAB' 'GBR' 'GEO' 'GHA' 'GIB' 'GIN' 'GMB' 'GNB' 'GNQ' 'GRC' 'GRD' 'GRL' 'GTM' 'GUM' 'GUY' 'HIC' 'HKG' 'HND' 'HPC' 'HRV' 'HTI' 'HUN' 'IBD' 'IBT' 'IDA' 'IDB' 'IDN' 'IDX' 'IMN' 'IND' 'INX' 'IRL' 'IRN' 'IRQ' 'ISL' 'ISR' 'ITA' 'JAM' 'JOR' 'JPN' 'KAZ' 'KEN' 'KGZ' 'KHM' 'KIR' 'KNA' 'KOR' 'KWT' 'LAC' 'LAO' 'LBN' 'LBR' 'LBY' 'LCA' 'LCN' 'LDC' 'LIC' 'LIE' 'LKA' 'LMC' 'LMY' 'LSO' 'LTE' 'LTU' 'LUX' 'LVA' 'MAC' 'MAF' 'MAR' 'MCO' 'MDA' 'MDG' 'MDV' 'MEA' 'MEX' 'MHL' 'MIC' 'MKD' 'MLI' 'MLT' 'MMR' 'MNA' 'MNE' 'MNG' 'MNP' 'MOZ' 'MRT' 'MUS' 'MWI' 'MYS' 'NAC' 'NAM' 'NCL' 'NER' 'NGA' 'NIC' 'NLD' 'NOR' 'NPL' 'NRU' 'NZL' 'OED' 'OMN' 'OSS' 'PAK' 'PAN' 'PER' 'PHL' 'PLW' 'PNG' 'POL' 'PRE' 'PRI' 'PRK' 'PRT' 'PRY' 'PSE' 'PSS' 'PST' 'PYF' 'QAT' 'ROU' 'RUS' 'RWA' 'SAS' 'SAU' 'SDN' 'SEN' 'SGP' 'SLB' 'SLE' 'SLV' 'SMR' 'SOM' 'SRB' 'SSA' 'SSD' 'SSF' 'SST' 'STP' 'SUR' 'SVK' 'SVN' 'SWE' 'SWZ' 'SXM' 'SYC' 'SYR' 'TCA' 'TCD' 'TEA' 'TEC' 'TGO' 'THA' 'TJK' 'TKM' 'TLA' 'TLS' 'TMN' 'TON' 'TSA' 'TSS' 'TTO' 'TUN' 'TUR' 'TUV' 'TZA' 'UGA' 'UKR' 'UMC' 'URY' 'USA' 'UZB' 'VCT' 'VEN' 'VGB' 'VIR' 'VNM' 'VUT' 'WLD' 'WSM' 'XKX' 'YEM' 'ZAF' 'ZMB' 'ZWE']
### 3.3 Summury of Handling Missing Data
After a thorough assessment of missing values in our merged dataset, we reached the following conclusions and made key decisions aligned with our business questions:
Overall Missingness:
FDI_2013: 31 missing values (~11.65% of rows)
NPL_2018: 119 missing values (~44.74% of rows)
PrivCreditBanks_2013: 37 missing values (~13.91%)
BroadMoney_2005: 61 missing values (~22.93%)
DomCredit_FS_2017: 213 missing values (over 80% missing)
PrivCredit_FS_2013: 43 missing values (~16.17%)
Region: 49 missing values (~18.42%)
IncomeGroup: 50 missing values (~18.80%)
Indicator Criticality:
Indicators such as FDI, credit measures, and broad money are central to our key business questions on economic performance, cross-country comparisons, and predictive modeling.
Categorical fields (Region and IncomeGroup) are essential for stratifying the analysis and deriving policy insights.
#### Decisions and Actions
Drop Extremely Incomplete Columns:
Retain and Handle Moderately Missing Numeric Indicators:
Handle Missing Categorical Values:
"Unknown". This preserves the observations for which the numeric data are otherwise available, ensuring robust group comparisons.Drop:
DomCredit_FS_2017 is dropped due to its excessive missingness (>80%).Retain (with Imputation):
NPL_2018: Retained for its importance in assessing financial stability. Further analysis may involve imputation or subsetting.FDI_2013, BroadMoney_2005, PrivCreditBanks_2013, and PrivCredit_FS_2013: Impute missing values with the median (robust to outliers).Categorical Fields:
Region and IncomeGroup, assign "Unknown" to missing entries rather than dropping rows. This ensures maximal data retention for cross-group analyses.This process ensures our dataset is now cleaner and more consistent, allowing for robust quantitative analyses and accurate comparisons across regions and income groups.
This section outlines the approach to identifying and handling outliers in our financial indicators. The goal is to ensure that extreme values do not distort our analyses, while preserving genuine economic signals that might be important for policy insights and predictive modeling.
# List of the indicator columns to analyze (adjust as needed)
indicator_cols = [
"FDI_2013",
"NPL_2018",
"PrivCreditBanks_2013",
"BroadMoney_2005",
"PrivCredit_FS_2013"
]
# Loop over each indicator column to detect outliers and visualize
for col in indicator_cols:
print(f"\n--- Analysis for {col} ---")
# --------------------------
# IQR Method
# --------------------------
Q1 = df_merged[col].quantile(0.25)
Q3 = df_merged[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers using the IQR method
outliers_iqr = df_merged[(df_merged[col] < lower_bound) | (df_merged[col] > upper_bound)]
num_outliers_iqr = outliers_iqr.shape[0]
print(f"IQR method: {num_outliers_iqr} outliers detected")
print(f"Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
# --------------------------
# Z-score Method
# --------------------------
# Compute z-scores on non-null values
z_scores = zscore(df_merged[col].dropna())
# Identify values with |z| > 3
outliers_z = df_merged[col].dropna()[np.abs(z_scores) > 3]
num_outliers_z = outliers_z.shape[0]
print(f"Z-score method: {num_outliers_z} outliers detected")
# --------------------------
# Visualizations
# --------------------------
plt.figure(figsize=(12, 5))
# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(x=df_merged[col])
plt.title(f"Boxplot of {col}")
plt.xlabel(col)
# Histogram with KDE
plt.subplot(1, 2, 2)
sns.histplot(df_merged[col], bins=30, kde=True)
plt.title(f"Histogram of {col}")
plt.xlabel(col)
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
--- Analysis for FDI_2013 --- IQR method: 32 outliers detected Lower bound: -2.72, Upper bound: 8.92 Z-score method: 2 outliers detected
--- Analysis for NPL_2018 --- IQR method: 10 outliers detected Lower bound: -8.81, Upper bound: 20.65 Z-score method: 4 outliers detected
--- Analysis for PrivCreditBanks_2013 --- IQR method: 18 outliers detected Lower bound: -30.21, Upper bound: 115.80 Z-score method: 4 outliers detected
--- Analysis for BroadMoney_2005 --- IQR method: 26 outliers detected Lower bound: -8.59, Upper bound: 96.46 Z-score method: 4 outliers detected
--- Analysis for PrivCredit_FS_2013 --- IQR method: 19 outliers detected Lower bound: -32.54, Upper bound: 125.39 Z-score method: 5 outliers detected
Below is a summary of outlier detection using two methods: the Interquartile Range (IQR) method and the Z-score method. We applied these methods to our core numeric indicators:
FDI_2013
NPL_2018
PrivCreditBanks_2013
BroadMoney_2005
PrivCredit_FS_2013
IQR Method
Lower Bound: -2.72
Upper Bound: 8.92
Number of Outliers: 32
Z-score Method
Threshold: |Z| > 3
Number of Outliers: 2
IQR Method
Lower Bound: -8.81
Upper Bound: 20.65
Number of Outliers: 10
Z-score Method
Threshold: |Z| > 3
Number of Outliers: 4
IQR Method
Lower Bound: -30.21
Upper Bound: 115.80
Number of Outliers: 18
Z-score Method
Threshold: |Z| > 3
Number of Outliers: 4
IQR Method
Lower Bound: -8.59
Upper Bound: 96.46
Number of Outliers: 26
Z-score Method
Threshold: |Z| > 3
Number of Outliers: 4
IQR Method
Lower Bound: -32.54
Upper Bound: 125.39
Number of Outliers: 19
Z-score Method
Threshold: |Z| > 3
Number of Outliers: 5
IQR vs. Z-score Differences
Potential Real-Economy Phenomena
By examining outliers with both IQR and Z-score methods—and through boxplots/histograms—we have a clearer view of how extreme values might impact our analysis. In the next step, we will decide which outlier treatment (if any) aligns best with our key business questions and modeling needs.
For some countries, very high FDI relative to GDP is expected. For instance, smaller economies (e.g., Afghanistan) may experience one-time surges in FDI, as noted in the metadata where fiscal-year reporting and alternative conversion factors are mentioned.
Similarly, high variability in exchange rates (as seen for China and Egypt) could lead to unusually high or low FDI and credit metrics. These notes suggest that extreme values might be driven by genuine economic events rather than data errors.
Several entries mention the use of alternative conversion factors and adjustments (e.g., for Albania, Belarus, and Egypt). This indicates that the reported financial indicators may already incorporate necessary adjustments, supporting their validity even when they appear as outliers.
The metadata for countries such as those in the Arab World or Sub-Saharan Africa sometimes report aggregated values (e.g., "26 countries" or regional aggregates). These should be considered carefully during analysis as they may not be directly comparable with data for individual nations.
Retaining Outliers: If an extreme value (for example, a very high FDI percentage) corresponds with a note indicating a one-time surge or a unique economic situation, it should be retained in the analysis because it reflects an important phenomenon.
Contextual Interpretation: Outliers in credit indicators might also be valid if they are driven by specific economic circumstances (e.g., multiple exchange rate systems or fiscal reporting differences).
In this step, we aim to identify and address outliers that are likely the result of data-entry errors (for example, a misplaced decimal point or an incorrect country code).
# Example for FDI_2013:
q1 = df_merged['FDI_2013'].quantile(0.25)
q3 = df_merged['FDI_2013'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
potential_errors = df_merged[(df_merged['FDI_2013'] < lower_bound) | (df_merged['FDI_2013'] > upper_bound)]
display(potential_errors[['Country Code', 'FDI_2013']])
| Country Code | FDI_2013 | |
|---|---|---|
| 0 | ABW | 9.974455 |
| 4 | AGO | -5.380131 |
| 5 | ALB | 9.816285 |
| 17 | BEL | -5.655618 |
| 22 | BHR | 11.021131 |
| 37 | CHE | -3.529486 |
| 44 | COG | 10.462101 |
| 52 | CYM | 432.859400 |
| 53 | CYP | 108.374905 |
| 56 | DJI | 14.000493 |
| 90 | GRD | 15.472927 |
| 96 | HKG | 27.877520 |
| 101 | HUN | -2.749089 |
| 111 | IRL | 29.814988 |
| 123 | KHM | 10.443059 |
| 125 | KNA | 16.487055 |
| 131 | LBR | 62.906854 |
| 137 | LIE | 18.068983 |
| 144 | LUX | 71.500434 |
| 152 | MDV | 10.950389 |
| 155 | MHL | 17.461454 |
| 162 | MNE | 9.997307 |
| 163 | MNG | 16.370076 |
| 165 | MOZ | 38.942865 |
| 166 | MRT | 15.589003 |
| 172 | NCL | 21.344995 |
| 176 | NLD | 35.131467 |
| 183 | OSS | 23.219703 |
| 208 | SGP | 20.934481 |
| 218 | SST | 19.168528 |
| 242 | TTO | -3.935812 |
| 253 | VCT | 20.077683 |
Most values for FDI_2013 fall within a plausible range (approximately -10% to 30%). However, the record for CYM (Cayman Islands) shows an exceptionally high value of 432.86, which is far beyond the typical range.
During our outlier analysis for the FDI_2013 indicator, we observed that most values fell within a plausible range (approximately -10% to 30%). However, one observation stood out:
Country Code: CYM (Cayman Islands)
Economic Plausibility: In general, FDI as a percentage of GDP tends to be within a moderate range. An FDI value of 432.86% is highly unusual and inconsistent with known economic conditions.
Domain Knowledge & Metadata Review: A review of domain knowledge and available metadata did not suggest that such an extreme FDI figure would be expected for the Cayman Islands. Instead, this anomaly is likely due to a data-entry error (e.g., a misplaced decimal).
Selective Treatment: Rather than applying a blanket outlier treatment to all flagged records, we decided to remove only the Cayman Islands record because it clearly represents a data error. Other flagged values, including those that are negative (which can be economically meaningful), were retained for further analysis.
We removed the record corresponding to the Cayman Islands from our dataset. This targeted approach helps maintain the integrity of the remaining data while ensuring that our analysis is not skewed by a clear data-entry anomaly.
# Identify the record for Cayman Islands (Country Code: CYM) with an implausible FDI_2013 value
cym_record = df_merged[df_merged["Country Code"] == "CYM"][["Country Code", "FDI_2013"]]
display(cym_record)
# Remove the record from the merged dataset
df_cleaned = df_merged[df_merged["Country Code"] != "CYM"]
# Verify removal
print("Remaining records for Cayman Islands:")
print(df_cleaned[df_cleaned["Country Code"] == "CYM"])
| Country Code | FDI_2013 | |
|---|---|---|
| 52 | CYM | 432.8594 |
Remaining records for Cayman Islands: Empty DataFrame Columns: [Country Name, Country Code, FDI_2013, NPL_2018, PrivCreditBanks_2013, BroadMoney_2005, PrivCredit_FS_2013, Region, IncomeGroup] Index: []
In our analysis of financial indicators, it is crucial to consider the business relevance of detected outliers. Rather than treating all extreme values uniformly, we evaluate whether these outliers reflect important economic events or if they are likely due to errors. Our approach is as follows:
FDI_2013: Outliers in FDI may signal significant events—such as a one-time surge in foreign investment in small economies—that can have major implications for economic performance. For example, an unusually high FDI figure might indicate a major acquisition or a sudden capital inflow, which are directly relevant to understanding investment trends.
NPL_2018: Extreme values in nonperforming loans (NPL) can reflect periods of financial distress or crisis. Given that NPL ratios are central to assessing financial stability, retaining such outliers (if they are genuine) is essential for analyzing systemic risk and guiding policy measures.
Credit Measures (e.g., PrivCreditBanks_2013, PrivCredit_FS_2013): Outliers in these indicators may indicate rapid credit expansion or contraction. These are key to understanding the health of the financial sector and its impact on economic growth.
Retain Meaningful Outliers: If an extreme value appears to be the result of an important economic event (e.g., a sudden surge in FDI due to a major foreign investment deal), we retain the outlier. Such values, although statistically extreme, provide valuable insights into the underlying economic phenomena.
Modify or Remove Peripheral Outliers: If an outlier seems to be due to data entry errors (such as a misplaced decimal) or if it pertains to an indicator that is less central to our core business questions, we may opt to transform (e.g., via winsorization or logarithmic transformation) or remove the outlier.
Business Relevance:
Concrete Decision:
NPL_2018_raw) in a separate column for reference. Rationale:
Business Relevance:
Concrete Decision:
Rationale:
Objective:
Concrete Decision:
Rationale:
Objective:
Concrete Decision:
log1p if there are zeros) or more aggressive winsorization (e.g., capping at the 95th percentile) for key indicators. Rationale:
# List of central indicators that are key to our business questions,
# including FDI_2013 which has not yet been treated.
central_indicators = ["FDI_2013", "NPL_2018", "PrivCreditBanks_2013", "BroadMoney_2005", "PrivCredit_FS_2013"]
# Define winsorization bounds (example: 1st and 99th percentiles)
lower_bound_pct, upper_bound_pct = 0.01, 0.99
# Create a copy of the cleaned DataFrame for further treatment
df_treated = df_cleaned.copy()
for col in central_indicators:
# Calculate bounds for winsorization
lower_bound = df_treated[col].quantile(lower_bound_pct)
upper_bound = df_treated[col].quantile(upper_bound_pct)
# Apply winsorization: cap values outside the bounds
df_treated[col] = df_treated[col].clip(lower_bound, upper_bound)
# Create log-transformed version if values are strictly positive;
# otherwise, use log1p to handle zero or small negatives
if df_treated[col].min() > 0:
df_treated[col + "_log"] = np.log(df_treated[col])
else:
# Add a small constant if necessary
df_treated[col + "_log"] = np.log1p(df_treated[col])
print(f"After winsorization, {col} summary:")
print(df_treated[col].describe(), "\n")
After winsorization, FDI_2013 summary: count 265.000000 mean 4.599880 std 7.085628 min -3.675764 25% 1.644728 50% 2.878637 75% 4.554473 max 47.569901 Name: FDI_2013, dtype: float64 After winsorization, NPL_2018 summary: count 147.000000 mean 7.325855 std 8.658414 min 0.234303 25% 2.237032 50% 3.686425 75% 9.603857 max 47.311912 Name: NPL_2018, dtype: float64 After winsorization, PrivCreditBanks_2013 summary: count 265.000000 mean 49.092047 std 34.688934 min 4.822738 25% 24.350265 50% 41.079745 75% 61.281489 max 168.525823 Name: PrivCreditBanks_2013, dtype: float64 After winsorization, BroadMoney_2005 summary: count 265.000000 mean 50.180365 std 31.469269 min 8.083328 25% 30.762692 50% 44.516882 75% 57.446790 max 170.812273 Name: BroadMoney_2005, dtype: float64 After winsorization, PrivCredit_FS_2013 summary: count 265.000000 mean 53.180171 std 39.043524 min 4.913568 25% 26.493768 50% 42.770132 75% 66.186850 max 184.162922 Name: PrivCredit_FS_2013, dtype: float64
C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\arraylike.py:396: RuntimeWarning: invalid value encountered in log1p result = getattr(ufunc, method)(*inputs, **kwargs)
df_treated.head()
| Country Name | Country Code | FDI_2013 | NPL_2018 | PrivCreditBanks_2013 | BroadMoney_2005 | PrivCredit_FS_2013 | Region | IncomeGroup | FDI_2013_log | NPL_2018_log | PrivCreditBanks_2013_log | BroadMoney_2005_log | PrivCredit_FS_2013_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | 9.974455 | NaN | 59.163673 | 57.446790 | 59.163673 | Latin America & Caribbean | High income | 2.395570 | NaN | 4.080308 | 4.050859 | 4.080308 |
| 1 | Africa Eastern and Southern | AFE | 2.342134 | NaN | 38.386182 | 45.229569 | 75.700968 | Unknown | Unknown | 1.206610 | NaN | 3.647698 | 3.811751 | 4.326791 |
| 2 | Afghanistan | AFG | 0.239801 | 8.886660 | 4.822738 | 44.516882 | 4.913568 | South Asia | Low income | 0.214951 | 2.184551 | 1.573342 | 3.795868 | 1.592000 |
| 3 | Africa Western and Central | AFW | 2.255601 | NaN | 12.576012 | 13.594007 | 12.657947 | Unknown | Unknown | 1.180377 | NaN | 2.531791 | 2.609629 | 2.538285 |
| 4 | Angola | AGO | -3.675764 | 23.221737 | 21.341167 | 13.299819 | 42.770132 | Sub-Saharan Africa | Lower middle income | NaN | 3.145089 | 3.060638 | 2.587750 | 3.755840 |
By capping the extreme values, we reduce the undue influence of rare, extreme outliers (for example, the implausibly high FDI value for Cayman Islands that was removed earlier). This is particularly important for descriptive analyses and for calculating mean-based statistics that might otherwise be skewed.
For subsequent predictive or econometric modeling, log-transformed versions of the central indicators are available. These transformations help stabilize the variance and reduce skewness, which is beneficial when the relationships between variables are multiplicative or when outliers might affect model estimates.
Our key research questions focus on understanding financial performance and stability across regions and income groups. Therefore, it is crucial to treat central indicators (such as FDI, NPL, and credit measures) carefully. We preserve these variables in both their winsorized (raw) and log-transformed forms to allow for robust descriptive and predictive analyses.Interpreatation of the results
When addressing key performance and stability questions (like Q1, Q3, and Q4), it’s essential to focus on the most critical indicators—such as NPL (Nonperforming Loans), credit measures, and FDI (Foreign Direct Investment).
Since these indicators are directly related to financial stability, we should apply robust methods like winsorization or log transformation to manage outliers. At the same time, it’s important to keep both the raw and treated versions of the data.
This dual approach allows for domain-specific interpretation, preserving important economic signals (like financial crises or booms) while also stabilizing the model.
For group comparisons (as in Q2), the focus shifts to descriptive analysis. In this context, it might be sufficient to use mild winsorization or even leave the data untreated. The goal here is to maintain the natural variability of the data, which helps make comparisons across regions and income groups more meaningful and accurate.
When it comes to choosing between modeling and descriptive analysis, the approach differs slightly:
Predictive or Econometric Modeling:
Descriptive Analysis:
The main goal of transforming skewed data is to reduce skewness, which helps stabilize variance and makes the data more normally distributed. This is especially useful for statistical models that perform better with data that resembles a normal distribution.
Examine Distributions:
Apply Transformations Where Needed:
# Define the list of numeric columns (financial indicators)
numeric_cols = ["FDI_2013", "NPL_2018", "PrivCreditBanks_2013", "BroadMoney_2005", "PrivCredit_FS_2013"]
# ===============================
# 1. Assess the Need for Scaling
# ===============================
# Print summary statistics of the numeric columns to review range and variance
print("Summary statistics for raw numeric columns:")
print(df_treated[numeric_cols].describe())
# Plot histograms for each numeric column to visualize their distribution
df_treated[numeric_cols].hist(bins=20, figsize=(12, 8))
plt.suptitle("Histograms of Raw Numeric Indicators")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Summary statistics for raw numeric columns:
FDI_2013 NPL_2018 PrivCreditBanks_2013 BroadMoney_2005 \
count 265.000000 147.000000 265.000000 265.000000
mean 4.599880 7.325855 49.092047 50.180365
std 7.085628 8.658414 34.688934 31.469269
min -3.675764 0.234303 4.822738 8.083328
25% 1.644728 2.237032 24.350265 30.762692
50% 2.878637 3.686425 41.079745 44.516882
75% 4.554473 9.603857 61.281489 57.446790
max 47.569901 47.311912 168.525823 170.812273
PrivCredit_FS_2013
count 265.000000
mean 53.180171
std 39.043524
min 4.913568
25% 26.493768
50% 42.770132
75% 66.186850
max 184.162922
Standard Scaler After scaling, each numeric column will have approximately mean=0 and standard deviation=1. Check df_scaled_std[numeric_cols].describe() to confirm the new mean and std for each feature.
MinMax Scaler Each numeric column is rescaled such that the minimum value becomes 0 and the maximum value becomes 1 (by default).
Check df_scaled_mm[numeric_cols].describe() to confirm the new min and max for each feature.
# We already defined numeric_cols above
numeric_cols = ["FDI_2013", "NPL_2018", "PrivCreditBanks_2013",
"BroadMoney_2005", "PrivCredit_FS_2013"]
# Create copies of df_treated so we don't overwrite original data
df_scaled_std = df_treated.copy()
df_scaled_mm = df_treated.copy()
# -----------------------------------------
# A) Standard Scaling
# -----------------------------------------
scaler_std = StandardScaler()
df_scaled_std[numeric_cols] = scaler_std.fit_transform(df_scaled_std[numeric_cols])
print("Summary statistics after Standard Scaling:\n")
print(df_scaled_std[numeric_cols].describe())
# -----------------------------------------
# B) MinMax Scaling
# -----------------------------------------
scaler_mm = MinMaxScaler()
df_scaled_mm[numeric_cols] = scaler_mm.fit_transform(df_scaled_mm[numeric_cols])
print("\nSummary statistics after MinMax Scaling:\n")
print(df_scaled_mm[numeric_cols].describe())
Summary statistics after Standard Scaling:
FDI_2013 NPL_2018 PrivCreditBanks_2013 BroadMoney_2005 \
count 2.650000e+02 1.470000e+02 2.650000e+02 2.650000e+02
mean -4.692263e-17 3.625218e-17 -2.681293e-17 -1.072517e-16
std 1.001892e+00 1.003419e+00 1.001892e+00 1.001892e+00
min -1.170158e+00 -8.218361e-01 -1.278594e+00 -1.340250e+00
25% -4.178519e-01 -5.897409e-01 -7.145967e-01 -6.182036e-01
50% -2.433799e-01 -4.217715e-01 -2.314128e-01 -1.803092e-01
75% -6.420414e-03 2.639964e-01 3.520577e-01 2.313423e-01
max 6.075866e+00 4.633962e+00 3.449508e+00 3.840577e+00
PrivCredit_FS_2013
count 2.650000e+02
mean 7.373557e-17
std 1.001892e+00
min -1.238565e+00
25% -6.847973e-01
50% -2.671310e-01
75% 3.337632e-01
max 3.361136e+00
Summary statistics after MinMax Scaling:
FDI_2013 NPL_2018 PrivCreditBanks_2013 BroadMoney_2005 \
count 265.000000 147.000000 265.000000 265.000000
mean 0.161490 0.150635 0.270424 0.258694
std 0.138268 0.183918 0.211902 0.193385
min 0.000000 0.000000 0.000000 0.000000
25% 0.103823 0.042541 0.119286 0.139369
50% 0.127902 0.073328 0.221480 0.223891
75% 0.160604 0.199024 0.344885 0.303348
max 1.000000 1.000000 1.000000 1.000000
PrivCredit_FS_2013
count 265.000000
mean 0.269271
std 0.217817
min 0.000000
25% 0.120392
50% 0.211195
75% 0.341833
max 1.000000
Each feature now has mean ~ 0 and standard deviation ~ 1 (minor floating-point variations are normal).
The minimum and maximum are not fixed at any particular value, but you can see that FDI_2013 extends up to ~6.08 standard deviations above its mean.
Each column now lies strictly between 0 and 1.
min=0 and max=1 for each indicator.
The mean and standard deviation vary, but they are not the focus of this transformation; it’s purely a range-based scaling.
StandardScaler is often preferred if you expect your model or algorithm to assume data is roughly centered around 0 with unit variance (e.g., many gradient-based methods, some clustering approaches).
MinMaxScaler is useful if you want all features in a strict [0, 1] range (e.g., neural networks sometimes prefer bounded inputs, or if you have constraints that require 0 = x = 1).
Tree-based methods (e.g., Random Forest, Gradient Boosting) are generally scale-invariant, so scaling is less critical.
Therefor multiple copies will be kept:
df_treated for raw descriptive analyses or domain-specific interpretability.
df_scaled_std for standard-scaled versions.
df_scaled_mm for min-max-scaled versions.
# Calculate skewness for each numeric indicator
print("Skewness of numeric indicators:")
for col in numeric_cols:
skewness = df_treated[col].skew()
print(f"{col}: {skewness:.2f}")
# Plot histograms to visually inspect distributions
df_treated[numeric_cols].hist(bins=20, figsize=(12, 8))
plt.suptitle("Histograms of Numeric Indicators")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# Plot boxplots for a more detailed look at outliers and spread
plt.figure(figsize=(12, 8))
sns.boxplot(data=df_treated[numeric_cols])
plt.title("Boxplots of Numeric Indicators")
plt.xticks(rotation=45)
plt.show()
Skewness of numeric indicators: FDI_2013: 3.90 NPL_2018: 2.61 PrivCreditBanks_2013: 1.31 BroadMoney_2005: 1.59 PrivCredit_FS_2013: 1.37
FDI_2013: Skewness = 3.90 Heavily skewed, but it also contains negative values (minimum around -3.68). A direct log transform is not possible without shifting the data first.
NPL_2018: Skewness = 2.61 Strongly skewed, minimum is positive (0.23), so a log transform is feasible without shifting.
PrivCreditBanks_2013: Skewness = 1.31 Moderately skewed, all values are positive.
BroadMoney_2005: Skewness = 1.59 Moderately skewed, all values are positive.
PrivCredit_FS_2013: Skewness = 1.37 Moderately skewed, all values are positive.
Since all five indicators have skewness above ~1, a transformation can help if you need more “normal-like” distributions—especially for parametric or linear modeling approaches.
FDI_2013 Reason: Highly skewed (skewness ~ 3.90), includes negative values. Approach: Shift + log transform if you want to reduce skewness for modeling. Implementation: Add a small constant to ensure all values are positive, then apply np.log.
NPL_2018 Reason: Strong skewness ( ~ 2.61), all positive. Approach: Direct log transform (or log1p if you prefer to handle very small values gracefully).
PrivCreditBanks_2013, BroadMoney_2005, PrivCredit_FS_2013 Reason: Moderately skewed (1.31–1.59), all positive.
Approach: Log transform is beneficial if you want a more normal-like distribution for regression or other parametric models. If interpretability in original units is crucial, you can keep them raw but consider storing a log version as well.
# 1) FDI_2013: Shift + Log
# Determine a shift to ensure all FDI_2013 values > 0
# Example: since min(FDI_2013) is around -3.68, let's shift by +4.0
df_treated["FDI_2013_shifted"] = df_treated["FDI_2013"] + 4.0
# Apply log transform
df_treated["FDI_2013_log"] = np.log(df_treated["FDI_2013_shifted"])
# 2) NPL_2018: Direct Log (all values are positive)
df_treated["NPL_2018_log"] = np.log(df_treated["NPL_2018"])
# 3) PrivCreditBanks_2013, BroadMoney_2005, PrivCredit_FS_2013:
# Also all positive, so direct log transform
df_treated["PrivCreditBanks_2013_log"] = np.log(df_treated["PrivCreditBanks_2013"])
df_treated["BroadMoney_2005_log"] = np.log(df_treated["BroadMoney_2005"])
df_treated["PrivCredit_FS_2013_log"] = np.log(df_treated["PrivCredit_FS_2013"])
# Optionally, if you prefer log1p for slight safety near zero:
# df_treated["NPL_2018_log"] = np.log1p(df_treated["NPL_2018"])
# ...and so on for other columns.
print("Transformed columns created:")
print(df_treated.columns)
Transformed columns created:
Index(['Country Name', 'Country Code', 'FDI_2013', 'NPL_2018',
'PrivCreditBanks_2013', 'BroadMoney_2005', 'PrivCredit_FS_2013',
'Region', 'IncomeGroup', 'FDI_2013_log', 'NPL_2018_log',
'PrivCreditBanks_2013_log', 'BroadMoney_2005_log',
'PrivCredit_FS_2013_log', 'FDI_2013_shifted'],
dtype='object')
In this step, we create new features that capture additional aspects of financial development and facilitate segmentation. Specifically, we:
Derive an Aggregated "Financial Depth" Score
We combine several credit measures (e.g., credit provided by banks, domestic credit to the private sector, and broad money) into a single composite score.
This score can serve as an overall measure of financial system depth.
Create Ratio Features
Create Interaction Terms or Categorical Features
Region and IncomeGroup variables to create a new feature (e.g., Region_Income) that may help us understand and segment the data better.These engineered features will be used in subsequent analyses (both descriptive and predictive) to explore relationships among indicators, perform segmentation, and better answer our business questions.
# 1. Create an Aggregated "Financial Depth" Score
# Using the raw (winsorized) credit indicators:
# PrivCreditBanks_2013, BroadMoney_2005, and PrivCredit_FS_2013
df_treated["Financial_Depth"] = (
df_treated["PrivCreditBanks_2013"] + df_treated["BroadMoney_2005"] + df_treated["PrivCredit_FS_2013"]
) / 3
# Optionally, if you prefer to use log-transformed values (which might better handle skewness):
df_treated["Financial_Depth_log"] = (
df_treated["PrivCreditBanks_2013_log"] + df_treated["BroadMoney_2005_log"] + df_treated["PrivCredit_FS_2013_log"]
) / 3
# 2. Create Ratio Features
# For example, the ratio of credit provided by banks to domestic credit to the private sector:
df_treated["Credit_Ratio"] = df_treated["PrivCreditBanks_2013"] / df_treated["PrivCredit_FS_2013"]
# 3. Create Interaction Terms or New Categorical Features
# For instance, combine Region and IncomeGroup to create a new categorical feature:
df_treated["Region_Income"] = df_treated["Region"] + "_" + df_treated["IncomeGroup"]
# Display a summary of the new features
print("Financial_Depth summary:")
print(df_treated["Financial_Depth"].describe(), "\n")
print("Credit_Ratio summary:")
print(df_treated["Credit_Ratio"].describe(), "\n")
print("Unique values in Region_Income:")
print(df_treated["Region_Income"].unique())
Financial_Depth summary: count 265.000000 mean 50.817528 std 31.829118 min 6.184821 25% 27.781186 50% 42.788920 75% 64.283902 max 174.500339 Name: Financial_Depth, dtype: float64 Credit_Ratio summary: count 265.000000 mean 0.953492 std 0.190210 min 0.266506 25% 0.960477 50% 0.989426 75% 0.999689 max 3.283636 Name: Credit_Ratio, dtype: float64 Unique values in Region_Income: ['Latin America & Caribbean_High income' 'Unknown_Unknown' 'South Asia_Low income' 'Sub-Saharan Africa_Lower middle income' 'Europe & Central Asia_Upper middle income' 'Europe & Central Asia_High income' 'Middle East & North Africa_High income' 'Latin America & Caribbean_Upper middle income' 'East Asia & Pacific_High income' 'Sub-Saharan Africa_Low income' 'South Asia_Lower middle income' 'North America_High income' 'Latin America & Caribbean_Lower middle income' 'Sub-Saharan Africa_Upper middle income' 'East Asia & Pacific_Upper middle income' 'Middle East & North Africa_Lower middle income' 'Middle East & North Africa_Upper middle income' 'East Asia & Pacific_Lower middle income' 'Europe & Central Asia_Lower middle income' 'South Asia_Upper middle income' 'East Asia & Pacific_Low income' 'Sub-Saharan Africa_High income' 'Middle East & North Africa_Low income' 'Latin America & Caribbean_Unknown']
In this phase, we derive new features from our existing financial indicators to better address our key business questions. The goal is to create variables that can capture the overall financial development of a country, allow for easier segmentation, and potentially enhance model performance. Below, we describe the new features we created and how they relate to our business questions.
Financial_Depth
Definition: An aggregated measure that combines several credit-related indicators.
Rationale: Financial depth is central to understanding the overall development of a country’s financial system. By aggregating measures such as credit to the private sector (from both banks and the financial sector) and broad money, we obtain a single score that can be used for cross-country comparisons and further analysis.
Output Example:
Financial_Depth summary:
count 265.000000
mean 50.817528
std 31.829118
min 6.184821
25% 27.781186
50% 42.788920
75% 64.283902
max 174.500339
Name: Financial_Depth, dtype: float64
Credit_Ratio
Definition: A derived ratio that compares one credit measure to another (for example, credit provided by banks relative to a broader credit measure).
Rationale: This ratio helps us assess whether a country’s credit distribution is skewed toward particular channels and may indicate potential vulnerabilities or strengths in the financial system.
Output Example:
Credit_Ratio summary:
count 265.000000
mean 0.953492
std 0.190210
min 0.266506
25% 0.960477
50% 0.989426
75% 0.999689
max 3.283636
Name: Credit_Ratio, dtype: float64
Region_Income
Definition: A composite categorical variable formed by concatenating the Region and IncomeGroup values (e.g., "Latin America & Caribbean_High income").
Rationale: This feature enables segmentation analysis to explore how financial indicators differ not just by region but by the interaction of region and income level. This is especially useful when addressing questions about regional differences and the influence of economic status.
Output Example:
Unique values in Region_Income:
['Latin America & Caribbean_High income', 'Unknown_Unknown', 'South Asia_Low income',
'Sub-Saharan Africa_Lower middle income', 'Europe & Central Asia_Upper middle income',
'Europe & Central Asia_High income', 'Middle East & North Africa_High income',
... ]
Our decisions to engineer these features are directly aligned with our key business questions:
Economic Performance & Predictive Modeling:
The aggregated measure (Financial_Depth) and the derived ratio (Credit_Ratio) capture key dimensions of financial development. These are critical for understanding overall economic performance and for building predictive models.
Cross-Country Comparisons & Policy Insights:
The composite variable (Region_Income) allows us to segment our analyses by both geographic and economic dimensions, ensuring that policy recommendations can be tailored to specific regional and income group contexts.
In previous steps, we created log-transformed and shifted versions of our numeric indicators to address skewness. While these transformed columns are not part of the newly engineered features, they remain available for:
Robust Modeling: For econometric or predictive models where the assumption of normality is important, the log-transformed indicators (e.g., FDI_2013_log) may be used as inputs.
Further Feature Derivation: They can be combined with other variables or used in creating interaction terms if needed.
By retaining both the raw/aggregated and transformed versions, we ensure that our analyses can address both descriptive and predictive business questions with the most appropriate data representation.
After completing data cleaning, transformations, and feature engineering, the next step is to ensure that the final dataset is fully integrated and validated for further analysis. This involves checking that the merging of primary indicators with country-level metadata has been successful, verifying consistency of key identifiers, and running final validation checks on the dataset.
Country Code). # Final Data Integration and Validation
# Let's assume our final integrated DataFrame (with all raw and transformed features) is df_treated.
# (This DataFrame includes the raw numeric indicators, log-transformed versions, winsorized values,
# and any derived features we have created.)
# 1. Check the final shape and structure of the dataset
print("Final integrated dataset shape:", df_treated.shape)
print("\nDataset info:")
print(df_treated.info())
# 2. Verify that there are no unexpected missing values
print("\nMissing values per column:")
print(df_treated.isnull().sum())
# 3. Review summary statistics for all numeric variables (raw and transformed)
numeric_cols_final = [
"FDI_2013", "NPL_2018", "PrivCreditBanks_2013", "BroadMoney_2005", "PrivCredit_FS_2013",
"FDI_2013_log", "NPL_2018_log", "PrivCreditBanks_2013_log", "BroadMoney_2005_log", "PrivCredit_FS_2013_log"
]
print("\nSummary statistics for numeric variables:")
print(df_treated[numeric_cols_final].describe())
# 4. Validate key identifiers (e.g., Country Code)
unique_country_codes = df_treated["Country Code"].nunique()
total_rows = df_treated.shape[0]
print(f"\nNumber of unique Country Codes: {unique_country_codes} out of {total_rows} rows")
# 5. Visualize distributions of raw and transformed indicators
# Plot histograms for each numeric column
df_treated[numeric_cols_final].hist(bins=20, figsize=(15,10))
plt.suptitle("Histograms of Raw and Transformed Numeric Indicators")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# Plot boxplots to inspect the distribution and potential outliers for each numeric column
plt.figure(figsize=(15, 10))
for i, col in enumerate(numeric_cols_final, 1):
plt.subplot(2, 5, i)
sns.boxplot(y=df_treated[col])
plt.title(col)
plt.tight_layout()
plt.show()
Final integrated dataset shape: (265, 19)
Dataset info:
<class 'pandas.core.frame.DataFrame'>
Index: 265 entries, 0 to 265
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 265 non-null object
1 Country Code 265 non-null object
2 FDI_2013 265 non-null float64
3 NPL_2018 147 non-null float64
4 PrivCreditBanks_2013 265 non-null float64
5 BroadMoney_2005 265 non-null float64
6 PrivCredit_FS_2013 265 non-null float64
7 Region 265 non-null object
8 IncomeGroup 265 non-null object
9 FDI_2013_log 265 non-null float64
10 NPL_2018_log 147 non-null float64
11 PrivCreditBanks_2013_log 265 non-null float64
12 BroadMoney_2005_log 265 non-null float64
13 PrivCredit_FS_2013_log 265 non-null float64
14 FDI_2013_shifted 265 non-null float64
15 Financial_Depth 265 non-null float64
16 Financial_Depth_log 265 non-null float64
17 Credit_Ratio 265 non-null float64
18 Region_Income 265 non-null object
dtypes: float64(14), object(5)
memory usage: 41.4+ KB
None
Missing values per column:
Country Name 0
Country Code 0
FDI_2013 0
NPL_2018 118
PrivCreditBanks_2013 0
BroadMoney_2005 0
PrivCredit_FS_2013 0
Region 0
IncomeGroup 0
FDI_2013_log 0
NPL_2018_log 118
PrivCreditBanks_2013_log 0
BroadMoney_2005_log 0
PrivCredit_FS_2013_log 0
FDI_2013_shifted 0
Financial_Depth 0
Financial_Depth_log 0
Credit_Ratio 0
Region_Income 0
dtype: int64
Summary statistics for numeric variables:
FDI_2013 NPL_2018 PrivCreditBanks_2013 BroadMoney_2005 \
count 265.000000 147.000000 265.000000 265.000000
mean 4.599880 7.325855 49.092047 50.180365
std 7.085628 8.658414 34.688934 31.469269
min -3.675764 0.234303 4.822738 8.083328
25% 1.644728 2.237032 24.350265 30.762692
50% 2.878637 3.686425 41.079745 44.516882
75% 4.554473 9.603857 61.281489 57.446790
max 47.569901 47.311912 168.525823 170.812273
PrivCredit_FS_2013 FDI_2013_log NPL_2018_log \
count 265.000000 265.000000 147.000000
mean 53.180171 1.961772 1.449513
std 39.043524 0.616644 1.083988
min 4.913568 -1.126282 -1.451142
25% 26.493768 1.730722 0.805053
50% 42.770132 1.928421 1.304657
75% 66.186850 2.146454 2.262054
max 184.162922 3.942938 3.856762
PrivCreditBanks_2013_log BroadMoney_2005_log PrivCredit_FS_2013_log
count 265.000000 265.000000 265.000000
mean 3.636274 3.734140 3.701176
std 0.763030 0.620503 0.784205
min 1.573342 2.089804 1.592000
25% 3.192543 3.426303 3.276910
50% 3.715515 3.795868 3.755840
75% 4.115478 4.050859 4.192482
max 5.127089 5.140565 5.215821
Number of unique Country Codes: 265 out of 265 rows
After completing data cleaning (including handling missing values, outlier treatment, transformations, and feature engineering), several checks were performed to ensure that the final dataset is well-integrated and ready for analysis.
Successful Merge:
The primary indicators (like FDI, NPL, and credit measures) were merged with country-level metadata (such as Region and IncomeGroup).
Dataset Structure:
FDI_2013, NPL_2018) FDI_2013_log, NPL_2018_log) Financial_Depth, Credit_Ratio, Region_Income) Missing Values:
Raw vs. Transformed Data:
_log). Financial_Depth:
Credit_Ratio:
Region_Income:
df_treated.describe() and other methods (.info(), .isnull().sum()) to check the final dataset. Column Removal:
DomCredit_FS_2017 due to excessive missingness (>80%). Retained Critical Data:
NPL_2018 despite about 45% missing values because of its importance for financial stability analysis. Outlier Handling:
Transformations:
FDI_2013_log, NPL_2018_log) to reduce skewness. Feature Engineering:
The goal is to identify the key financial indicators that drive economic performance and financial stability. In this analysis the focus will beon understanding how these indicators influence a country’s financial system in a way that is relative to its income group. By using a normalized measure—Financial_Depth_z—we can compare countries on a level playing field, thereby better addressing ther business questions on subgroup (regional and income-based) differences.
The objective is to determine which financial indicators (e.g., FDI, NPL, credit measures) have the greatest influence on a country’s financial development and stability. However, instead of modeling absolute differences in financial depth, we now normalize each country’s Financial_Depth within its income group, creating a z-score:
# Compute the z-score for Financial_Depth within each IncomeGroup
df_treated["Financial_Depth_z"] = df_treated.groupby("IncomeGroup")["Financial_Depth"] \
.transform(lambda x: (x - x.mean()) / x.std())
# Set the normalized variable as the target
target_variable = "Financial_Depth_z"
Normalization Within Income Groups By computing the z-score for each country’s Financial_Depth within its respective IncomeGroup, we account for the natural differences between high-income and low-income countries. This allows us to compare countries on an “apples-to-apples” basis, relative to peers with similar income levels.
Focus on Relative Performance Rather than modeling absolute values (which might be dominated by overall income level), the normalized target emphasizes how far above or below a country is from its group’s mean. This perspective directly aligns with our business questions around subgroup differences and provides clearer policy insights for each income category.
Flexibility for Analysis We still retain the raw Financial_Depth measure and its log-transformed version for descriptive purposes or alternative analyses. However, Financial_Depth_z is expected to yield more robust results when we specifically want to isolate which indicators push a country’s financial system beyond what is typical for its income group.
Once we define Financial_Depth_z as our target, it is essential to confirm its distribution and ensure it behaves as intended (mean near zero, standard deviation near one).
# Print summary statistics for Financial_Depth_z
print("Target Variable Summary (Normalized):")
print(df_treated[target_variable].describe())
# Plot the distribution
plt.figure(figsize=(8, 5))
sns.histplot(df_treated[target_variable], bins=20, kde=True)
plt.title(f"Distribution of {target_variable}")
plt.xlabel(target_variable)
plt.ylabel("Frequency")
plt.show()
Target Variable Summary (Normalized): count 2.650000e+02 mean -4.692263e-17 std 9.923953e-01 min -1.476199e+00 25% -7.705420e-01 50% -3.314208e-01 75% 4.995803e-01 max 3.509409e+00 Name: Financial_Depth_z, dtype: float64
Mean ˜ 0, Std ˜ 1 As intended, the z-score transformation centers each income group around zero. Countries with negative scores fall below their group’s average, while those with positive scores exceed it.
Range and Spread Most values lie within -1.48 to +3.51. A few countries display relatively high positive z-scores, suggesting that they significantly outperform their income-group peers in financial depth.
Shape of Distribution The histogram shows a mild right skew, but overall the distribution is considerably tighter than the raw Financial_Depth measure. This indicates the normalization is successfully reducing income-level disparities and focusing on relative financial system performance.
Implications for Modeling Because each country’s financial depth is now expressed in standard deviations from its income-group mean, we can better isolate which indicators help a country “outperform” or “underperform” within its group. This aligns closely with our business questions on subgroup comparisons and provides clearer policy insights for each income category.
Objective:
Identify which financial indicators are most influential in driving economic performance and stability within each income group. Instead of explaining absolute differences in financial depth, our focus is on understanding how countries outperform or underperform relative to their peers. This perspective directly addresses business questions about subgroup (regional and income-based) variations.
Analysis Goals:
Understand Relative Impact:
Target Variable: Financial_Depth_z, a normalized measure that standardizes each country’s Financial_Depth within its income group.
Purpose: Determine how much each financial indicator (e.g., FDI_2013, NPL_2018, credit measures) contributes to explaining relative economic performance—i.e., whether a country is above or below its group’s average.
Evaluate Relationships:
Financial_Depth_z).Predictive Modeling (Optional):
Build models to forecast or simulate scenarios where a country’s financial indicators change.
The focus remains on relative performance within each income category, offering clearer policy insights for under- or over-performing economies.
Key Considerations:
Target Variable Selection:
We use Financial_Depth_z rather than the raw or log-transformed Financial_Depth.
Rationale: By centering each country’s financial depth around its income-group mean, we remove broad income-level effects and concentrate on what drives a country to perform above or below expectations relative to its peers.
Predictors:
In addition to core financial indicators (e.g., FDI_2013, NPL_2018, PrivCreditBanks_2013, BroadMoney_2005, PrivCredit_FS_2013), include relevant categorical variables such as Region and/or the combined feature Region_Income.
Note: Because we have normalized by IncomeGroup, including IncomeGroup itself as a predictor may be redundant—but interactions can still be explored in advanced models.
Modeling Techniques:
Baseline Models:
Advanced Models:
Evaluation Metrics:
Use metrics such as R², RMSE, and MAE to assess model performance. For example, an R² of 0.50 indicates that half the variance in a country’s within-income-group financial depth is explained by the predictors.
By adopting Financial_Depth_z as our target, our modeling framework will:
Financial_Depth_z as the dependent variable.Region_Income and credit measures).By focusing on relative financial performance through Financial_Depth_z, our models will yield insights that are directly actionable for policymakers and stakeholders looking to improve a nation’s standing within its income category.
Decision:
We will use Financial_Depth_z as our target measure. This variable is a normalized composite index that standardizes each country’s Financial_Depth relative to its IncomeGroup. In doing so, we remove the overall income-level effects and focus on how a country performs relative to its peers. This approach directly addresses our revised business questions regarding subgroup (regional and income-based) differences.
Note: Because our target is now Financial_Depth_z, the raw composite measure (Financial_Depth) and its log-transformed version (Financial_Depth_log) are reserved for descriptive purposes only and will not be included in the predictor set.
# Set the normalized variable as the target
target_variable = "Financial_Depth_z"
y = df_treated[target_variable]
# Verify the target variable by printing its summary statistics
print("Target Variable Summary (Normalized):")
print(df_treated[target_variable].describe())
# Plot the distribution of the normalized target variable
plt.figure(figsize=(8, 5))
sns.histplot(df_treated[target_variable], bins=20, kde=True)
plt.title(f"Distribution of {target_variable}")
plt.xlabel(target_variable)
plt.ylabel("Frequency")
plt.show()
Target Variable Summary (Normalized): count 2.650000e+02 mean -4.692263e-17 std 9.923953e-01 min -1.476199e+00 25% -7.705420e-01 50% -3.314208e-01 75% 4.995803e-01 max 3.509409e+00 Name: Financial_Depth_z, dtype: float64
Core Financial Indicators:
To reduce skewness and obtain more normally distributed data, we use the log-transformed versions of our primary indicators. These include:
FDI_2013_log: Log-transformed Foreign Direct Investment as a percentage of GDP.
NPL_2018_log: Log-transformed Nonperforming Loans ratio.
PrivCreditBanks_2013_log: Log-transformed measure of credit provided by banks to the private sector.
BroadMoney_2005_log: Log-transformed measure of Broad Money as a percentage of GDP.
PrivCredit_FS_2013_log: Log-transformed measure of domestic credit to the private sector from the financial sector.
Engineered Feature:
Categorical Feature:
Region_Income: A combined variable that concatenates Region and IncomeGroup, capturing the interaction effects between geography and income status.
Note: Since our target variable is normalized within IncomeGroup, including IncomeGroup directly may be redundant; however, Region_Income remains valuable for subgroup analysis.
initial_numeric_predictors = [
"FDI_2013_log",
"NPL_2018_log",
"PrivCreditBanks_2013_log",
"BroadMoney_2005_log",
"PrivCredit_FS_2013_log",
"Credit_Ratio"
]
Prevents data leakage: You do not want to fit transformations (e.g., PCA, dropping columns due to correlation) on the entire dataset.
Ensures that the test set remains truly unseen, so your model evaluation is realistic.
# Create an initial DataFrame X_all with your numeric predictors
X_all_numeric = df_treated[initial_numeric_predictors]
# Perform a train-test split using the normalized target variable y (Financial_Depth_z)
from sklearn.model_selection import train_test_split
X_train_initial, X_test_initial, y_train, y_test = train_test_split(
X_all_numeric,
y,
test_size=0.2,
random_state=42
)
print("X_train_initial shape:", X_train_initial.shape)
print("X_test_initial shape: ", X_test_initial.shape)
print("y_train shape:", y_train.shape)
print("y_test shape: ", y_test.shape)
X_train_initial shape: (212, 6) X_test_initial shape: (53, 6) y_train shape: (212,) y_test shape: (53,)
# Check for missing values in the training set
print("Missing values in X_train_initial:")
print(X_train_initial.isnull().sum())
# Check for missing values in the testing set
print("\nMissing values in X_test_initial:")
print(X_test_initial.isnull().sum())
Missing values in X_train_initial: FDI_2013_log 0 NPL_2018_log 95 PrivCreditBanks_2013_log 0 BroadMoney_2005_log 0 PrivCredit_FS_2013_log 0 Credit_Ratio 0 dtype: int64 Missing values in X_test_initial: FDI_2013_log 0 NPL_2018_log 23 PrivCreditBanks_2013_log 0 BroadMoney_2005_log 0 PrivCredit_FS_2013_log 0 Credit_Ratio 0 dtype: int64
Based on the output, you have missing values in the NPL_2018_log column (95 in training and 23 in testing). To handle these missing values without leaking any information from the test set into the training process, you should fit an imputer on the training data and then transform both the training and test data using that imputer.
# Instantiate the imputer with a chosen strategy (mean imputation used here)
imputer = SimpleImputer(strategy='mean')
# Fit the imputer on the training set and transform it, preserving the original indices
X_train_imputed = pd.DataFrame(
imputer.fit_transform(X_train_initial),
columns=X_train_initial.columns,
index=X_train_initial.index
)
# Transform the test set using the same imputer, preserving the original test indices
X_test_imputed = pd.DataFrame(
imputer.transform(X_test_initial),
columns=X_test_initial.columns,
index=X_test_initial.index
)
# Verify that there are no missing values in the imputed datasets
print("Missing values in X_train_imputed:")
print(X_train_imputed.isnull().sum())
print("\nMissing values in X_test_imputed:")
print(X_test_imputed.isnull().sum())
Missing values in X_train_imputed: FDI_2013_log 0 NPL_2018_log 0 PrivCreditBanks_2013_log 0 BroadMoney_2005_log 0 PrivCredit_FS_2013_log 0 Credit_Ratio 0 dtype: int64 Missing values in X_test_imputed: FDI_2013_log 0 NPL_2018_log 0 PrivCreditBanks_2013_log 0 BroadMoney_2005_log 0 PrivCredit_FS_2013_log 0 Credit_Ratio 0 dtype: int64
Fitting on the Training Data Only: The imputer is fit on X_train_initial so that it computes the mean (or other statistic) based solely on the training data. This ensures that no information from the test set is used in the imputation process.
Transforming Both Sets: The same imputer is then applied to both the training and test sets. This way, the test data is imputed using the statistics learned from the training data.
Verification: After imputation, we print out the missing values again to ensure that the datasets now have no NaNs.
By applying these steps, you help prevent data leakage during preprocessing, which is critical for a robust evaluation of your model's performance. Once imputation is complete, you can proceed with further transformations or directly move on to model training using X_train_imputed and X_test_imputed.
# Define the list of candidate predictor columns (from our imputed training set)
candidate_predictors = [
"FDI_2013_log",
"NPL_2018_log",
"PrivCreditBanks_2013_log",
"BroadMoney_2005_log",
"PrivCredit_FS_2013_log",
"Credit_Ratio"
]
# Calculate the correlation matrix using the training data (X_train_imputed)
corr_matrix_train = X_train_imputed[candidate_predictors].corr()
# Plot the correlation heatmap for the candidate predictors
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix_train, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix for Candidate Predictors (Training Set)")
plt.show()
FDI_2013_log
Shows a moderate positive correlation with NPL_2018_log (approximately +0.28).
Exhibits negative correlations with the credit-related indicators (ranging from about -0.20 to -0.28).
Overall, these modest correlations suggest that FDI_2013_log provides a distinct dimension of information not fully captured by the other variables.
NPL_2018_log
Also moderately correlated with FDI_2013_log (+0.28).
Negatively correlated with the credit measures (around -0.28 to -0.31).
This indicates it contributes an independent perspective on nonperforming loans relative to both FDI and the credit-based indicators.
PrivCreditBanks_2013_log, BroadMoney_2005_log, and PrivCredit_FS_2013_log
PrivCreditBanks_2013_log and PrivCredit_FS_2013_log are extremely highly correlated (˜ +0.97).
BroadMoney_2005_log also has strong positive correlations with these two (˜ +0.69 to +0.71).
These results confirm significant multicollinearity among the core credit-related variables.
Credit_Ratio
Correlations with other features remain relatively low or moderate (generally under |0.30|).
This indicates it may add unique insights about the composition of credit without duplicating the other measures.
Multicollinearity in Credit Indicators
When features exceed correlation thresholds of about ±0.70, linear models can produce unstable coefficient estimates. Small data variations might lead to large swings in estimated effects.
Although tree-based models are more robust to multicollinearity, highly correlated predictors can still overshadow one another in feature-importance measures.
Overlap with the Target
Because our target (Financial_Depth_z) is derived (in part) from these same credit variables, there is a risk of “double-counting” if we rely heavily on the credit indicators.
While normalizing by income group reduces some direct overlap, it does not fully eliminate the influence of shared components.
Interpreting results carefully—especially in linear models—is essential to avoid inflated performance estimates or misleading policy conclusions.
By recognizing these correlations, we can make more informed decisions about feature selection (e.g., removing or combining highly collinear credit measures) and choose modeling approaches (such as regularization or tree-based methods) that handle collinearity more gracefully.
From the correlation matrix and key observations, we see extremely high correlations among the core credit indicators:
PrivCreditBanks_2013_log and PrivCredit_FS_2013_log (˜ +0.97).
BroadMoney_2005_log also shows strong correlations with these two (˜ +0.69 to +0.71).
Such multicollinearity can destabilize coefficient estimates in linear models and obscure the relative importance of these features in both linear and tree-based approaches. Additionally, since our target (Financial_Depth_z) partly reflects credit measures (albeit normalized by income group), we risk inflating model performance or double-counting the same signal if we include all of these highly correlated predictors without careful consideration.
Feature Dropping or Consolidation
Option A: Drop one of the highly correlated credit variables (e.g., remove either PrivCreditBanks_2013_log or PrivCredit_FS_2013_log) to reduce redundancy.
Option B: Combine correlated features into a single aggregated measure (e.g., take the mean or principal component of PrivCreditBanks_2013_log and PrivCredit_FS_2013_log). This can help capture their shared variance while removing duplication.
Regularization
Lasso (L1) or Elastic Net can help shrink coefficients of redundant predictors toward zero, effectively performing feature selection.
Ridge (L2) reduces coefficient variance but tends to keep correlated features rather than discarding any single one, which can still be useful if you prefer to retain multiple credit measures but control their combined influence.
Principal Component Analysis (PCA)
If you prefer a linear combination of the correlated features, PCA can reduce them to one or two principal components that explain most of the variance.
This approach can be especially helpful if you still want to reflect variations among different credit indicators but avoid the collinearity problem in linear modeling.
Domain-Driven Feature Selection
Since Financial_Depth_z already encapsulates some aspect of overall financial depth, you might choose the single most relevant credit variable (e.g., PrivCreditBanks_2013_log) and drop the rest to avoid redundancy.
Alternatively, keep Credit_Ratio as it appears to provide unique composition information, and select only one broad credit indicator for the main effect.
predictors_reduced = [
"FDI_2013_log",
"NPL_2018_log",
"PrivCreditBanks_2013_log", # keep one core credit measure
"BroadMoney_2005_log",
"Credit_Ratio"
]
# Now your X_train_imputed and X_test_imputed can be filtered to these columns
X_train_reduced = X_train_imputed[predictors_reduced]
X_test_reduced = X_test_imputed[predictors_reduced]
print("Reduced predictor set:\n", predictors_reduced)
Reduced predictor set: ['FDI_2013_log', 'NPL_2018_log', 'PrivCreditBanks_2013_log', 'BroadMoney_2005_log', 'Credit_Ratio']
corr_reduced = X_train_reduced.corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr_reduced, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix After Dropping PrivCredit_FS_2013_log")
plt.show()
Shows only modest correlations with other predictors (absolute values mostly under 0.30).
Maintains a small positive correlation with NPL_2018_log (~ +0.08) and small negative correlations with the credit measures (e.g., ~ -0.28 with PrivCreditBanks_2013_log).
Interpretation: FDI appears to provide a distinct dimension of information not fully captured by NPL or credit-based variables.
Negatively correlated with both PrivCreditBanks_2013_log (~ -0.31) and BroadMoney_2005_log (~ -0.28).
Essentially uncorrelated with Credit_Ratio (~ +0.05).
Interpretation: This suggests that nonperforming loans add an independent perspective on financial stability relative to both FDI and credit indicators.
Their correlation is now around +0.69, which is still considered a strong positive relationship, but notably less severe than the +0.97 we observed with PrivCredit_FS_2013_log.
Interpretation: By dropping PrivCredit_FS_2013_log, we have substantially reduced the worst-case collinearity among the core credit indicators.
Continues to have relatively low correlations (generally under |0.20|) with the remaining predictors.
Interpretation: This feature likely adds a unique perspective on the composition of credit, without duplicating the information captured by the other measures.
With PrivCredit_FS_2013_log removed, the highest pairwise correlation among credit indicators is now ~ +0.69 (instead of +0.97).
This reduction helps stabilize coefficient estimates in linear models and yields clearer feature-importance results in tree-based methods.
Since our target (Financial_Depth_z) partially derives from credit metrics, removing the most redundant predictor lowers the risk of “double-counting.”
We still must interpret results with caution, but the streamlined feature set should be more robust.
By omitting PrivCredit_FS_2013_log, we reduce the most severe collinearity while preserving a representative mix of FDI, NPL, and credit-related indicators—along with Credit_Ratio. This refined predictor set is better positioned to deliver stable modeling results and clearer insights into which factors truly influence a country’s relative financial depth (Financial_Depth_z).
Although the target variable (Financial_Depth_z) is normalized by IncomeGroup, including the combined categorical feature Region_Income still adds value. This feature captures residual regional effects and the interaction of region and income, which may explain additional variations in relative financial performance.
One-hot encoding Region_Income ensures that any subtle differences among subgroup categories are available as predictors for our models.
# 1. Extract the training and testing portions of the Region_Income feature
region_train = df_treated.loc[X_train_imputed.index, "Region_Income"]
region_test = df_treated.loc[X_test_imputed.index, "Region_Income"]
# 2. One-hot encode the training portion of Region_Income (drop the first category for baseline)
region_dummies_train = pd.get_dummies(region_train, prefix="RI", drop_first=True)
# 3. One-hot encode the test portion
region_dummies_test = pd.get_dummies(region_test, prefix="RI", drop_first=True)
# 4. Align columns between training and test sets in case one set contains categories
# not present in the other. Missing columns are filled with zeros.
region_dummies_train, region_dummies_test = region_dummies_train.align(
region_dummies_test,
join='outer',
axis=1,
fill_value=0
)
# 5. Concatenate these dummy columns to your numeric predictors
X_train_final = pd.concat([X_train_imputed, region_dummies_train], axis=1)
X_test_final = pd.concat([X_test_imputed, region_dummies_test], axis=1)
print("Final train shape:", X_train_final.shape)
print("Final test shape: ", X_test_final.shape)
Final train shape: (212, 29) Final test shape: (53, 29)
print("Final train shape:", X_train_final.shape)
print("Final test shape:", X_test_final.shape)
print("Training columns:", X_train_final.columns.tolist())
print("Testing columns:", X_test_final.columns.tolist())
Final train shape: (212, 29) Final test shape: (53, 29) Training columns: ['FDI_2013_log', 'NPL_2018_log', 'PrivCreditBanks_2013_log', 'BroadMoney_2005_log', 'PrivCredit_FS_2013_log', 'Credit_Ratio', 'RI_East Asia & Pacific_Low income', 'RI_East Asia & Pacific_Lower middle income', 'RI_East Asia & Pacific_Upper middle income', 'RI_Europe & Central Asia_High income', 'RI_Europe & Central Asia_Lower middle income', 'RI_Europe & Central Asia_Upper middle income', 'RI_Latin America & Caribbean_High income', 'RI_Latin America & Caribbean_Lower middle income', 'RI_Latin America & Caribbean_Unknown', 'RI_Latin America & Caribbean_Upper middle income', 'RI_Middle East & North Africa_High income', 'RI_Middle East & North Africa_Low income', 'RI_Middle East & North Africa_Lower middle income', 'RI_Middle East & North Africa_Upper middle income', 'RI_North America_High income', 'RI_South Asia_Low income', 'RI_South Asia_Lower middle income', 'RI_South Asia_Upper middle income', 'RI_Sub-Saharan Africa_High income', 'RI_Sub-Saharan Africa_Low income', 'RI_Sub-Saharan Africa_Lower middle income', 'RI_Sub-Saharan Africa_Upper middle income', 'RI_Unknown_Unknown'] Testing columns: ['FDI_2013_log', 'NPL_2018_log', 'PrivCreditBanks_2013_log', 'BroadMoney_2005_log', 'PrivCredit_FS_2013_log', 'Credit_Ratio', 'RI_East Asia & Pacific_Low income', 'RI_East Asia & Pacific_Lower middle income', 'RI_East Asia & Pacific_Upper middle income', 'RI_Europe & Central Asia_High income', 'RI_Europe & Central Asia_Lower middle income', 'RI_Europe & Central Asia_Upper middle income', 'RI_Latin America & Caribbean_High income', 'RI_Latin America & Caribbean_Lower middle income', 'RI_Latin America & Caribbean_Unknown', 'RI_Latin America & Caribbean_Upper middle income', 'RI_Middle East & North Africa_High income', 'RI_Middle East & North Africa_Low income', 'RI_Middle East & North Africa_Lower middle income', 'RI_Middle East & North Africa_Upper middle income', 'RI_North America_High income', 'RI_South Asia_Low income', 'RI_South Asia_Lower middle income', 'RI_South Asia_Upper middle income', 'RI_Sub-Saharan Africa_High income', 'RI_Sub-Saharan Africa_Low income', 'RI_Sub-Saharan Africa_Lower middle income', 'RI_Sub-Saharan Africa_Upper middle income', 'RI_Unknown_Unknown']
print("Missing values in final training set:\n", X_train_final.isnull().sum())
print("Missing values in final test set:\n", X_test_final.isnull().sum())
Missing values in final training set: FDI_2013_log 0 NPL_2018_log 0 PrivCreditBanks_2013_log 0 BroadMoney_2005_log 0 PrivCredit_FS_2013_log 0 Credit_Ratio 0 RI_East Asia & Pacific_Low income 0 RI_East Asia & Pacific_Lower middle income 0 RI_East Asia & Pacific_Upper middle income 0 RI_Europe & Central Asia_High income 0 RI_Europe & Central Asia_Lower middle income 0 RI_Europe & Central Asia_Upper middle income 0 RI_Latin America & Caribbean_High income 0 RI_Latin America & Caribbean_Lower middle income 0 RI_Latin America & Caribbean_Unknown 0 RI_Latin America & Caribbean_Upper middle income 0 RI_Middle East & North Africa_High income 0 RI_Middle East & North Africa_Low income 0 RI_Middle East & North Africa_Lower middle income 0 RI_Middle East & North Africa_Upper middle income 0 RI_North America_High income 0 RI_South Asia_Low income 0 RI_South Asia_Lower middle income 0 RI_South Asia_Upper middle income 0 RI_Sub-Saharan Africa_High income 0 RI_Sub-Saharan Africa_Low income 0 RI_Sub-Saharan Africa_Lower middle income 0 RI_Sub-Saharan Africa_Upper middle income 0 RI_Unknown_Unknown 0 dtype: int64 Missing values in final test set: FDI_2013_log 0 NPL_2018_log 0 PrivCreditBanks_2013_log 0 BroadMoney_2005_log 0 PrivCredit_FS_2013_log 0 Credit_Ratio 0 RI_East Asia & Pacific_Low income 0 RI_East Asia & Pacific_Lower middle income 0 RI_East Asia & Pacific_Upper middle income 0 RI_Europe & Central Asia_High income 0 RI_Europe & Central Asia_Lower middle income 0 RI_Europe & Central Asia_Upper middle income 0 RI_Latin America & Caribbean_High income 0 RI_Latin America & Caribbean_Lower middle income 0 RI_Latin America & Caribbean_Unknown 0 RI_Latin America & Caribbean_Upper middle income 0 RI_Middle East & North Africa_High income 0 RI_Middle East & North Africa_Low income 0 RI_Middle East & North Africa_Lower middle income 0 RI_Middle East & North Africa_Upper middle income 0 RI_North America_High income 0 RI_South Asia_Low income 0 RI_South Asia_Lower middle income 0 RI_South Asia_Upper middle income 0 RI_Sub-Saharan Africa_High income 0 RI_Sub-Saharan Africa_Low income 0 RI_Sub-Saharan Africa_Lower middle income 0 RI_Sub-Saharan Africa_Upper middle income 0 RI_Unknown_Unknown 0 dtype: int64
print("Data types in final training set:\n", X_train_final.dtypes)
print("Data types in final test set:\n", X_test_final.dtypes)
Data types in final training set: FDI_2013_log float64 NPL_2018_log float64 PrivCreditBanks_2013_log float64 BroadMoney_2005_log float64 PrivCredit_FS_2013_log float64 Credit_Ratio float64 RI_East Asia & Pacific_Low income bool RI_East Asia & Pacific_Lower middle income bool RI_East Asia & Pacific_Upper middle income bool RI_Europe & Central Asia_High income bool RI_Europe & Central Asia_Lower middle income bool RI_Europe & Central Asia_Upper middle income bool RI_Latin America & Caribbean_High income bool RI_Latin America & Caribbean_Lower middle income bool RI_Latin America & Caribbean_Unknown bool RI_Latin America & Caribbean_Upper middle income bool RI_Middle East & North Africa_High income bool RI_Middle East & North Africa_Low income bool RI_Middle East & North Africa_Lower middle income bool RI_Middle East & North Africa_Upper middle income bool RI_North America_High income bool RI_South Asia_Low income bool RI_South Asia_Lower middle income bool RI_South Asia_Upper middle income bool RI_Sub-Saharan Africa_High income bool RI_Sub-Saharan Africa_Low income bool RI_Sub-Saharan Africa_Lower middle income bool RI_Sub-Saharan Africa_Upper middle income bool RI_Unknown_Unknown bool dtype: object Data types in final test set: FDI_2013_log float64 NPL_2018_log float64 PrivCreditBanks_2013_log float64 BroadMoney_2005_log float64 PrivCredit_FS_2013_log float64 Credit_Ratio float64 RI_East Asia & Pacific_Low income int64 RI_East Asia & Pacific_Lower middle income bool RI_East Asia & Pacific_Upper middle income bool RI_Europe & Central Asia_High income bool RI_Europe & Central Asia_Lower middle income bool RI_Europe & Central Asia_Upper middle income bool RI_Latin America & Caribbean_High income bool RI_Latin America & Caribbean_Lower middle income bool RI_Latin America & Caribbean_Unknown int64 RI_Latin America & Caribbean_Upper middle income bool RI_Middle East & North Africa_High income bool RI_Middle East & North Africa_Low income int64 RI_Middle East & North Africa_Lower middle income bool RI_Middle East & North Africa_Upper middle income int64 RI_North America_High income int64 RI_South Asia_Low income int64 RI_South Asia_Lower middle income int64 RI_South Asia_Upper middle income int64 RI_Sub-Saharan Africa_High income int64 RI_Sub-Saharan Africa_Low income bool RI_Sub-Saharan Africa_Lower middle income bool RI_Sub-Saharan Africa_Upper middle income bool RI_Unknown_Unknown bool dtype: object
The information provided by section 5.6 and 5.7 is fully usable for modeling and training. The final preprocessed dataset is complete, has no missing values, and the categorical variables have been properly encoded and aligned. No further revisions or rework are necessary before proceeding to model training.
This setup ensures that your modeling phase will have a robust and consistent dataset for both baseline and advanced modeling approaches.
In this step, we will train our selected models using the final predictor matrix and target variable, and then adjust their hyperparameters using cross-validation.
Our objective is to identify which financial indicators most strongly influence economic performance and stability—measured as relative financial depth within income groups. We now use the normalized target variable, Financial_Depth_z, which represents how far a country’s financial depth deviates (in standard deviations) from the average of its income group. This approach focuses on relative performance, addressing subgroup (regional and income-based) differences directly.
To achieve this, we begin with a baseline linear regression model for its interpretability. The coefficient estimates will tell us how a one-unit change in a predictor (e.g., in log-transformed FDI or credit measures) is associated with a change in Financial_Depth_z (i.e., in standard deviations from the group mean). Later, we will explore advanced models—such as regularized regression (Lasso, Ridge, Elastic Net) and tree-based methods—to capture nonlinear relationships and improve predictive performance.
The train-test split was already performed in Section 5.
print("Final train shape:", X_train_final.shape)
print("Final test shape: ", X_test_final.shape)
print("Target shape (training):", y_train.shape)
print("Target shape (test):", y_test.shape)
Final train shape: (212, 29) Final test shape: (53, 29) Target shape (training): (212,) Target shape (test): (53,)
We now fit a simple linear regression model using our final predictor matrix and normalized target variable. This baseline model serves two main purposes:
# Instantiate the baseline model
lr = LinearRegression()
# Perform 5-fold cross-validation on the training set using negative MSE as the scoring metric
cv_scores = cross_val_score(lr, X_train_final, y_train, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-cv_scores)
print("Cross-validated RMSE scores:", rmse_scores)
print("Mean CV RMSE:", rmse_scores.mean())
# Train the model on the full training set
lr.fit(X_train_final, y_train)
# Predict on the training and test sets
y_train_pred = lr.predict(X_train_final)
y_test_pred = lr.predict(X_test_final)
# Evaluate model performance
print("Training R²:", r2_score(y_train, y_train_pred))
print("Test R²:", r2_score(y_test, y_test_pred))
print("Training RMSE:", mean_squared_error(y_train, y_train_pred, squared=False))
print("Test RMSE:", mean_squared_error(y_test, y_test_pred, squared=False))
Cross-validated RMSE scores: [0.66713984 0.37008364 0.48800311 0.3305429 0.50404137] Mean CV RMSE: 0.4719621726510848 Training R²: 0.8843429080109859 Test R²: 0.8737936773136867 Training RMSE: 0.3371095745551946 Test RMSE: 0.3504940422868276
To further improve our model and account for potential multicollinearity or nonlinear relationships, we will explore the following approaches:
Regularized Regression Models (Lasso, Ridge, Elastic Net): These methods can shrink or zero out coefficients for redundant predictors, helping to address multicollinearity.
Tree-Based Models (Decision Trees, Random Forests, Gradient Boosting): These models capture nonlinear relationships and interactions without the need for extensive preprocessing and provide feature importance metrics.
# Define a Ridge Regression model and a grid of alpha values
ridge = Ridge()
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train_final, y_train)
# Output the best hyperparameter and corresponding performance
print("Best Ridge alpha:", grid_search.best_params_['alpha'])
print("Best cross-validated R²:", grid_search.best_score_)
Best Ridge alpha: 1 Best cross-validated R²: 0.781102134314075
# Instantiate a Random Forest model
rf = RandomForestRegressor(random_state=42)
# Define a grid of hyperparameters to search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10]
}
# Set up GridSearchCV with 5-fold cross-validation
grid_search_rf = GridSearchCV(rf, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search_rf.fit(X_train_final, y_train)
# Output the best hyperparameters and the corresponding cross-validated R² score
print("Best Random Forest parameters:", grid_search_rf.best_params_)
print("Best cross-validated R²:", grid_search_rf.best_score_)
# Train the final Random Forest model using the best parameters
best_rf = grid_search_rf.best_estimator_
y_test_pred_rf = best_rf.predict(X_test_final)
# Evaluate performance on the test set
rf_r2 = r2_score(y_test, y_test_pred_rf)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred_rf))
print("Random Forest Test R²:", rf_r2)
print("Random Forest Test RMSE:", rf_rmse)
Best Random Forest parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}
Best cross-validated R²: 0.7651823926826363
Random Forest Test R²: 0.8421518264621712
Random Forest Test RMSE: 0.39197634983573215
In this section, we explore an advanced ensemble model—XGBoost—as a next step to further improve our predictive performance. Although our Random Forest model performed well, XGBoost is known to often outperform bagging methods by iteratively correcting errors and capturing complex nonlinear relationships and interactions.
Rationale:
The following code demonstrates how to set up and tune an XGBoost model using GridSearchCV, and then evaluate its performance on the test set.
# Define the XGBoost Regressor
xgb_reg = xgb.XGBRegressor(random_state=42, objective='reg:squarederror')
# Set up a grid of hyperparameters to search
param_grid_xgb = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.8, 1.0]
}
# Set up GridSearchCV with 5-fold cross-validation using R² as the scoring metric
grid_search_xgb = GridSearchCV(xgb_reg, param_grid_xgb, cv=5, scoring='r2', n_jobs=-1)
grid_search_xgb.fit(X_train_final, y_train)
# Output the best hyperparameters and corresponding cross-validated R² score
print("Best XGBoost parameters:", grid_search_xgb.best_params_)
print("Best cross-validated R² for XGBoost:", grid_search_xgb.best_score_)
# Train the final XGBoost model using the best parameters and evaluate on the test set
best_xgb = grid_search_xgb.best_estimator_
y_test_pred_xgb = best_xgb.predict(X_test_final)
xgb_r2 = r2_score(y_test, y_test_pred_xgb)
xgb_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred_xgb))
print("XGBoost Test R²:", xgb_r2)
print("XGBoost Test RMSE:", xgb_rmse)
Best XGBoost parameters: {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best cross-validated R² for XGBoost: 0.8397349356769614
XGBoost Test R²: 0.9208420230005324
XGBoost Test RMSE: 0.27757952429517563
These results indicate that the XGBoost model outperforms our baseline Random Forest model, achieving a higher R² and lower RMSE on the test set. This suggests that XGBoost is an excellent candidate for capturing the complex relationships in our data and providing robust predictions of relative financial performance (as measured by Financial_Depth_z).
Using the final XGBoost model (best_xgb) trained with X_train_final (which now includes both log-transformed financial indicators and one-hot encoded Region_Income features), we extract and visualize feature importances.
# Get the feature importances from the best XGBoost model
feature_importances = best_xgb.feature_importances_
# Create a DataFrame for easy plotting and interpretation.
# Make sure the feature names match those in X_train_final.
importance_df = pd.DataFrame({
'Feature': X_train_final.columns,
'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)
# Display the feature importance DataFrame
print(importance_df)
# Plot a horizontal bar chart for better readability
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('XGBoost Feature Importances')
plt.gca().invert_yaxis() # Highest importance on top
plt.show()
Feature Importance 4 PrivCredit_FS_2013_log 0.412706 3 BroadMoney_2005_log 0.125121 25 RI_Sub-Saharan Africa_Low income 0.067518 12 RI_Latin America & Caribbean_High income 0.063836 28 RI_Unknown_Unknown 0.046111 6 RI_East Asia & Pacific_Low income 0.038043 7 RI_East Asia & Pacific_Lower middle income 0.033252 22 RI_South Asia_Lower middle income 0.026395 9 RI_Europe & Central Asia_High income 0.024914 17 RI_Middle East & North Africa_Low income 0.021987 16 RI_Middle East & North Africa_High income 0.020584 2 PrivCreditBanks_2013_log 0.020185 20 RI_North America_High income 0.020022 24 RI_Sub-Saharan Africa_High income 0.013820 13 RI_Latin America & Caribbean_Lower middle income 0.011917 26 RI_Sub-Saharan Africa_Lower middle income 0.010169 18 RI_Middle East & North Africa_Lower middle income 0.006624 0 FDI_2013_log 0.005962 1 NPL_2018_log 0.005469 21 RI_South Asia_Low income 0.005382 5 Credit_Ratio 0.004584 8 RI_East Asia & Pacific_Upper middle income 0.004464 15 RI_Latin America & Caribbean_Upper middle income 0.003523 11 RI_Europe & Central Asia_Upper middle income 0.002863 14 RI_Latin America & Caribbean_Unknown 0.001626 19 RI_Middle East & North Africa_Upper middle income 0.001092 10 RI_Europe & Central Asia_Lower middle income 0.000928 27 RI_Sub-Saharan Africa_Upper middle income 0.000903 23 RI_South Asia_Upper middle income 0.000000
Based on the XGBoost feature importances, Credit_Depth_PC overwhelmingly dominates, suggesting that credit-related variables strongly explain Financial_Depth_log. We also see a meaningful contribution from certain region-income categories, notably RI_South Asia_Low income. If our policy goal is to highlight other levers (beyond improving credit depth), we might experiment with removing or adjusting Credit_Depth_PC and reevaluating feature importances. Alternatively, if maximizing predictive accuracy is paramount, we can retain Credit_Depth_PC and focus on verifying whether additional fine-tuning yields incremental gains.
Although our XGBoost model shows strong predictive performance (Test R²: 0.92, RMSE: 0.28) and our feature importance analysis highlights the dominance of PrivCredit_FS_2013_log and the influence of several Region_Income dummies, we need to gain a deeper understanding of how these features contribute to individual predictions.
sing SHAP (SHapley Additive exPlanations) will allow us to:
# Step 2: Create a SHAP explainer using our best XGBoost model
explainer = shap.TreeExplainer(best_xgb)
# Step 3: Compute SHAP values for the training set (or a representative subset)
shap_values = explainer.shap_values(X_train_final)
# Step 4: Generate a summary plot to visualize global feature importance
shap.summary_plot(shap_values, X_train_final, plot_type="bar")
# Step 5: Generate a detailed summary plot to visualize feature effects and interactions
shap.summary_plot(shap_values, X_train_final)
The SHAP results indicate that even after normalizing financial depth by income group (via the Financial_Depth_z target), several of the one-hot encoded Region_Income dummy variables remain highly influential. This finding suggests that there are important regional or subgroup effects beyond what is captured by income normalization alone. For example, certain regions may consistently outperform or underperform their income peers due to differences in policy environments, regulatory frameworks, or institutional quality.
To better understand these nuances, our next step is to investigate why specific regions deviate from the norm. Two potential approaches include:
Partial Dependence Plots (PDPs):
PDPs can help visualize how changes in one or two key predictors (for instance, PrivCredit_FS_2013_log and BroadMoney_2005_log) affect the predicted relative financial depth while averaging out the effects of other variables. This will allow us to see whether the interaction between credit supply and money supply has a multiplicative effect on performance relative to the income-group average.
SHAP Interaction Plots:
By exploring SHAP interaction values, we can quantify and visualize how pairs of features (e.g., the interaction between PrivCredit_FS_2013_log and BroadMoney_2005_log) jointly influence the model's output. This method provides a deeper understanding of whether the combined effect of these credit measures is more (or less) than the sum of their individual impacts.
Investigating these interactions will not only refine our model’s interpretability but will also yield actionable insights. For instance, if the joint effect of high private sector credit and ample broad money supply drives a country significantly above its income-group average, this could inform targeted policy interventions to strengthen financial development in underperforming regions. The next step is to generate these plots and analyze the interactions further.
# =============================================================================
# Domain-Based Interpretation: Partial Dependence Plots
# =============================================================================
# 1. Single Feature Partial Dependence:
features_single = ["PrivCredit_FS_2013_log"]
PartialDependenceDisplay.from_estimator(
best_xgb, # the best XGBoost model from earlier
X_train_final, # training data (including numeric and one-hot encoded features)
features=features_single,
feature_names=X_train_final.columns,
kind="average" # 'average' shows the partial dependence curve
)
plt.title("Partial Dependence for PrivCredit_FS_2013_log")
plt.show()
# 2. Two-Feature Partial Dependence:
# Investigate the interaction between PrivCredit_FS_2013_log and BroadMoney_2005_log.
features_pair = [("PrivCredit_FS_2013_log", "BroadMoney_2005_log")]
PartialDependenceDisplay.from_estimator(
best_xgb,
X_train_final,
features=features_pair,
feature_names=X_train_final.columns,
kind="average" # you could also use "both" to show ICE curves along with the average
)
plt.title("2D Partial Dependence: PrivCredit_FS_2013_log vs. BroadMoney_2005_log")
plt.show()
# =============================================================================
# Domain-Based Interpretation: SHAP Dependence and Interaction Plots
# =============================================================================
# Create a SHAP explainer based on the best XGBoost model
explainer = shap.TreeExplainer(best_xgb)
shap_values = explainer.shap_values(X_train_final)
# 1. SHAP Dependence Plot for a Single Feature:
# This shows how the value of PrivCredit_FS_2013_log impacts predictions.
shap.dependence_plot(
"PrivCredit_FS_2013_log", # feature to plot
shap_values, # computed SHAP values for the training set
X_train_final, # training data
feature_names=X_train_final.columns
)
# 2. SHAP Dependence Plot with Interaction:
# Here, we explore the interaction effect between PrivCredit_FS_2013_log and BroadMoney_2005_log.
shap.dependence_plot(
"PrivCredit_FS_2013_log",
shap_values,
X_train_final,
feature_names=X_train_final.columns,
interaction_index="BroadMoney_2005_log" # shows how this feature moderates the effect of PrivCredit_FS_2013_log
)
The partial dependence plots provide insight into how changes in specific predictors affect the normalized target variable (Financial_Depth_z). For example, the plot for PrivCredit_FS_2013_log shows the average effect of changes in domestic credit to the private sector (from the financial sector) on the relative financial depth:
Observation: As the value of PrivCredit_FS_2013_log increases, the predicted Financial_Depth_z also increases. This indicates that higher domestic credit (in log-scale) is associated with a country performing above its income-group average in financial depth.
Interpretation: This suggests that the ability of a country to extend domestic credit to the private sector is a critical factor in surpassing the typical financial system development for its income level.
When examining the 2D partial dependence plot for the interaction between PrivCredit_FS_2013_log and BroadMoney_2005_log:
Observation: The plot shows how the combined levels of private sector credit and broad money influence the target variable. For instance, it may reveal that when BroadMoney_2005_log is high, the effect of increasing PrivCredit_FS_2013_log on Financial_Depth_z might be amplified (or possibly dampened).
Interpretation: These interactions can hint that countries with both a high level of broad money supply and high domestic credit to the private sector tend to significantly outperform their income peers in financial depth.
SHAP dependence plots further enhance our understanding:
Single-Feature SHAP Plot: The SHAP plot for PrivCredit_FS_2013_log confirms its dominant role. By observing the spread of SHAP values, we see how incremental increases in this feature shift predictions upward (i.e., toward a higher Financial_Depth_z). This provides a direct, interpretable measure of the feature's influence.
Interaction Plot: When setting interaction_index="BroadMoney_2005_log", the plot illustrates how the effect of PrivCredit_FS_2013_log on Financial_Depth_z varies with the level of BroadMoney_2005_log. For example, if countries with higher BroadMoney_2005_log show steeper SHAP value gradients, this would suggest an amplified impact on relative financial performance.
Furthermore, the feature importance analysis reveals that several Region_Income dummy variables hold non-negligible importance. This indicates that even after normalization by income group, regional effects persist—implying that structural, policy, or institutional factors tied to geography play a role in relative financial development.
We examine variables such as FDI_2013_log, NPL_2018_log, and various credit measures (e.g., PrivCredit_FS_2013_log, BroadMoney_2005_log).
FDI_2013_log
Definition: The logarithm of foreign direct investment (FDI) as a percentage of GDP in 2013. This transformation reduces skewness and helps capture external capital inflows, which reflect a country’s attractiveness to foreign investors.
NPL_2018_log
Definition: The logarithm of bank nonperforming loans (NPL) as a percentage of total loans in 2018. It measures asset quality and credit risk in the banking sector, with the log transformation helping to normalize the distribution.
PrivCredit_FS_2013_log
Definition: The logarithm of domestic credit provided to the private sector by the financial sector in 2013. This is a key indicator of how effectively a country’s financial system supports private sector credit needs. Our analysis reveals that this variable is especially dominant, suggesting it is a major driver of relative financial system development.
BroadMoney_2005_log
Definition: The logarithm of broad money supply as a percentage of GDP in 2005. This variable captures the overall liquidity in the economy and reflects the scale of financial intermediation.
By normalizing Financial_Depth within income groups—creating Financial_Depth_z—we isolate the effect of each predictor relative to its peers.
Financial_Depth_z
Definition: A normalized (z-scored) version of the composite measure Financial_Depth, calculated within each income group. It indicates how many standard deviations a country’s overall financial development deviates from the income-group average, enabling “apples-to-apples” comparisons.
The inclusion of one-hot encoded Region_Income dummies further highlights that even after income normalization, regional effects persist.
Region_Income
Definition: A composite categorical variable that combines a country’s region and its income group. It is one-hot encoded to capture the interaction effects between regional context and income level, facilitating more detailed subgroup comparisons.
For example, countries in certain regions (e.g., Sub-Saharan Africa or Latin America) show systematic deviations from their income-group norms, which may be due to structural or policy differences.
We developed composite measures like Financial_Depth (and its log-transformed and normalized versions) to capture overall financial development.
Financial_Depth
Definition: A composite index that aggregates several credit-related indicators (such as credit measures and money supply). Its normalized form, Financial_Depth_z, is particularly useful for comparing countries on a level playing field across different income groups.
Our feature importance analysis and SHAP visualizations indicate that credit measures (especially PrivCredit_FS_2013_log) and money supply (e.g., BroadMoney_2005_log) are strong predictors of relative performance.
This suggests that policies aimed at strengthening the domestic credit environment and ensuring stable money supply management may be crucial.
Furthermore, persistent regional effects imply that tailored reforms—such as improving regulatory frameworks in regions that underperform relative to their peers—could yield significant benefits.
Dominant Role of Credit Measures:
PrivCredit_FS_2013_log exhibits very high importance (~0.41), underscoring that the ability to extend private sector credit from the financial sector is a critical driver of superior financial depth.
BroadMoney_2005_log follows, suggesting that the availability of broad money also plays a significant role.
Regional/Subgroup Effects:
Moderate Influence of Other Predictors:
Partial Dependence of PrivCredit_FS_2013_log:
The partial dependence plot for PrivCredit_FS_2013_log indicates that, on average, increases in this variable are associated with higher Financial_Depth_z.
When combined with BroadMoney_2005_log in a 2D plot, the interaction reveals that the effect of private sector credit on relative financial depth may be amplified (or moderated) depending on the level of broad money supply.
Based on these results, several hypotheses and policy implications emerge:
Hypothesis 1:
Higher values of PrivCredit_FS_2013_log drive a country’s performance above its income-group average, likely reflecting more efficient credit extension and better-regulated financial systems.
Policy Implication: Enhance mechanisms that support the expansion and prudent management of private credit.
Hypothesis 2:
The interaction between PrivCredit_FS_2013_log and BroadMoney_2005_log indicates that a robust money supply can magnify the positive effects of credit expansion.
Policy Implication: Ensure monetary policies promote sufficient liquidity without inducing inflation, thereby stabilizing credit flows.
Hypothesis 3:
Significant regional effects (via Region_Income dummies) suggest that structural and institutional factors unique to regions play a crucial role.
Policy Implication: Tailor regulatory and institutional reforms to address the specific challenges of underperforming regions. For example, countries in “RI_Sub-Saharan Africa_Low income” might benefit from targeted improvements in financial oversight and access to capital markets.
In this notebook, we set out to answer several key questions about how different financial indicators influence a country’s economic performance and stability. To do this, we developed a composite measure of financial development and then normalized it within income groups (creating the variable Financial_Depth_z). This normalization allows us to compare countries on a “level playing field” by showing how many standard deviations each country’s financial development is above or below the average for countries with a similar income level.
By converting our raw composite measure into a normalized z-score (Financial_Depth_z), we removed the broad effects that come simply from being a high- or low-income country. Instead, we focused on how a country performs relative to its peers.
Our analysis shows that the log-transformed domestic credit provided to the private sector (PrivCredit_FS_2013_log) is by far the most influential predictor, with an importance score of about 0.41. This indicates that how well a country can extend credit to private businesses is a major factor in exceeding the typical financial development of its income group. The measure of broad money supply (BroadMoney_2005_log), which reflects the overall liquidity or available money in an economy, also plays a significant role.
While the log-transformed measures for foreign direct investment (FDI_2013_log) and bank nonperforming loans (NPL_2018_log) provide additional insights—by capturing aspects like foreign investment and credit risk—their individual contributions are smaller compared to the dominant credit measures.
We also included a feature that combines a country’s region and income level (Region_Income). Even after normalizing for income, the one-hot encoded dummies for this variable still add predictive power. This suggests that regional factors—such as differences in regulatory environments or institutional quality—affect financial development beyond what is captured by income level alone.
Our baseline linear regression model, which is easy to interpret, performed well. More advanced ensemble methods like Random Forests and especially XGBoost further improved performance. For instance, our best XGBoost model achieved a test R² of 0.92 and an RMSE of 0.28, indicating robust predictive power.
The strong influence of PrivCredit_FS_2013_log suggests that a country’s ability to provide credit to its private sector is crucial. In simple terms, if banks and other financial institutions can lend more effectively to businesses, this tends to lift a country’s financial performance above what would be expected based on its income group. Policies that improve lending practices and strengthen financial regulation could therefore be beneficial.
The significance of BroadMoney_2005_log indicates that the overall supply of money in an economy is important. Stable and adequate liquidity (the availability of money) can support business activity and credit flow. Thus, policymakers might focus on managing the money supply carefully to avoid issues like inflation while still providing enough capital for growth.
The persistent predictive power of the Region_Income variable suggests that geographic factors matter. Some regions consistently perform better or worse than others, which might be due to differences in policy, regulation, or institutional strength. This finding implies that reforms should be tailored to regional challenges—for example, strengthening financial oversight in regions that underperform relative to their peers.
Some variables (such as NPL_2018_log) required imputation because of missing values. Future work might explore more advanced imputation techniques or additional data sources to improve the analysis.
While our advanced models (especially XGBoost) performed very well, further exploration of interaction effects and other ensemble methods might uncover even more nuanced relationships.
In summary, our analysis shows that a normalized composite measure of financial development (Financial_Depth_z) effectively captures the nuanced impact of various financial indicators. The dominant role of credit measures, along with persistent regional effects, provides clear, actionable insights that can guide policy reforms aimed at enhancing financial stability and performance.